Skip to main panel
>
  • File
  • Edit
  • View
  • Run
  • Kernel
  • Tabs
  • Settings
  • Help
PyPI Manager

Warning

The JupyterLab development team is excited to have a robust third-party extension community. However, we do not review third-party extensions, and some extensions may introduce security risks or contain malicious code that runs on your machine. Moreover in order to work, this panel needs to fetch data from web services. Do you agree to activate this feature?
Please read the privacy policy.

Installed

No entries

Discover

No entries

Open Tabs
Close All

    Kernels
    Shut Down All

      Language servers
      Shut Down All

        Terminals
        Shut Down All

          Assignment03.ipynb

          No Headings

          The table of contents shows headings in notebooks and supported files.

          /
          Name
          ...
          Last Modified
          File Size
          • Contacts9 months ago
          • Downloads4 seconds ago
          • Favorites10 days ago
          • Links9 months ago
          • Music9 months ago
          • OneDrive12 minutes ago
          • Saved Games9 months ago
          • Searches9 months ago
          • Videos10 days ago
          • 02_LogisticRegression.ipynb3 days ago117.1 KB
          • 02-Data Types Operators and Strings.ipynb14 days ago195.4 KB
          • 03 - Data Structures in Python.ipynb10 days ago637.5 KB
          • 04 - Conditions_Branching_Loops.ipynb10 days ago11.3 KB
          • 04_KMeansClustering_1.ipynb2 hours ago1.1 MB
          • 04_KMeansClustering.ipynb2 hours ago9.2 KB
          • 05 - Functions and Classes.ipynb10 days ago47.6 KB
          • 08_DataPreProcessing_II-checkpoint.ipynb8 days ago1.5 MB
          • 09_Exploratory_Data_Analysis.ipynb8 days ago1018.4 KB
          • 10_Model_Development_Linear_Regression.ipynb3 days ago1.4 MB
          • Assignment03.ipynb1 minute ago1.4 MB
          • Untitled.ipynb15 days ago5.9 KB
          • Untitled1.ipynb13 days ago11.6 KB
          • Untitled2.ipynb12 days ago1.6 KB
          • Untitled3.ipynb8 days ago14.3 KB
          • Untitled4.ipynb8 days ago92.8 KB
          • Untitled5.ipynb9 days ago72 B
          • Untitled6.ipynb8 days ago61.8 KB
          • Untitled7.ipynb8 days ago275.5 KB
          • Untitled8.ipynb7 days ago32.4 KB
          • Untitled9.ipynb3 days ago685.1 KB
            # Exploratory Data Analysis
            - Preliminary step in data analysis to;
            - Summarize main characteristics of the data
            - Gain better understanding of the data set
            - Uncover relationship between variables
            - Extract important variables

            Exploratory Data Analysis¶

            • Preliminary step in data analysis to;
              • Summarize main characteristics of the data
              • Gain better understanding of the data set
              • Uncover relationship between variables
              • Extract important variables
            # Case Study

            Case Study¶

            ![image.png](attachment:b62bca63-335a-40a0-a3ab-39997a2cfd4e.png)

            image.png

            ![image.png](attachment:969f4b4d-06a0-4ddc-9a89-63f837a22b77.png)

            image.png

            ![image.png](attachment:f9788245-8f91-48bc-94d1-7be64f0984ca.png)

            image.png

            ![image.png](attachment:155dc4a4-cfe0-4855-a581-7883ac4dd75c.png)

            image.png

            # Reading & Writing Data in Python

            Reading & Writing Data in Python¶

            [ ]:
            import pandas as pd
            import numpy as np
            [ ]:
            path = "Auto85.csv"
            df = pd.read_csv(path, header = None) #read_csv() assumes data has a header
            [ ]:
            headers = ["symboling","normalized-losses","make","fuel-type", "aspiration", "num-of-doors", "body-style", "drive-wheels","engine-location", "wheel-base", "length", "width", "height", "curb-weight", "engine-type", "num-of-cylinders", "engine-size", "fuel-system", "bore", "stroke", "compression-ratio", "horsepower", "peak-rpm", "city-mpg", "highway-mpg", "Price"]
            [ ]:
            df.columns = headers
            [ ]:
            df.head()
            # Exploratory Data Analysis

            Exploratory Data Analysis¶

            ![image.png](attachment:a186219c-8b90-48bb-9d9e-1c9bfca0d238.png)

            image.png

            ![image.png](attachment:50df215b-aa8a-4ddb-8f86-c6669faa34c2.png)

            image.png

            ![image.png](attachment:26a5efd5-ddef-43f4-b6ea-571740171ac6.png)

            image.png

            # Descriptive Statistics

            Descriptive Statistics¶

            ![image.png](attachment:387b3fb2-b4fe-4b57-a449-a4b80c3668ea.png)

            image.png

            [ ]:
            df.describe()
            ## Summarize the categorical data by using _value_counts( )_

            Summarize the categorical data by using value_counts( )¶

            [ ]:
            drive_wheels_counts = df["drive-wheels"].value_counts()
            drive_wheels_counts
            [ ]:
            drive_wheels_counts.rename({'drive-wheels':'value_counts'}, inplace = True)
            drive_wheels_counts.index.name = "drive-wheels"
            drive_wheels_counts
            ![image.png](attachment:baf5d9ce-b16e-481f-a95c-cb11445c931e.png)

            image.png

            [ ]:
            df["Price"].replace("?", np.nan, inplace = True)
            df["Price"] = pd.to_numeric(df["Price"])
            [ ]:
            import seaborn as sns
            sns.boxplot(x="drive-wheels", y = "Price", data = df)
            ![image.png](attachment:4a4f21ac-4698-47a2-8d1a-f74b1da8d182.png)

            image.png

            [ ]:
            import matplotlib.pyplot as plt
            plt.scatter(df["engine-size"], df["Price"])
            plt.title("Relationship b/w Engine Size and Price")
            plt.xlabel("Engine Size")
            plt.ylabel("Price")
            # Group by Python

            Group by Python¶

            ![image.png](attachment:3dc580a8-395e-4e67-9a79-e53917d69309.png)

            image.png

            [ ]:
            df
            [ ]:
            df_test = df[ ["drive-wheels","body-style", "Price"] ]
            [ ]:
            df_test
            [ ]:
            df_grp = df_test.groupby( [ "drive-wheels","body-style"], as_index=False).mean()
            df_grp
            [ ]:
            df_pivot = df_grp.pivot(index = "drive-wheels", columns = "body-style")
            df_pivot
            [ ]:
            import matplotlib.pyplot as plt
            plt.pcolor(df_pivot, cmap = "BuGn_r")
            plt.colorbar()
            plt.xlabel("Body Style")
            plt.ylabel("Drive Wheel")
            plt.show()
            # Correlation

            Correlation¶

            ![image.png](attachment:9890fff5-d532-408d-8d74-a666944db08f.png)

            image.png

            ![image.png](attachment:a7e20f36-4419-4994-9601-82726e44860b.png)

            image.png

            [ ]:
            import seaborn as sns
            sns.regplot(x="engine-size", y = "Price", data=df)
            plt.ylim(0,)
            plt.xlabel("Engine Size")
            plt.title("Correlation b/w Engine Size and Price")
            [ ]:
            import seaborn as sns
            import matplotlib.pyplot as plt
            sns.regplot(x="highway-mpg", y="Price", data=df)
            plt.ylim(0,)
            plt.xlabel("highway-mpg")
            plt.title("Negative Correlation b/w highway-mpg and Price")
            [ ]:
            """ Find the no of missing values in peak-rpm column"""
            df["peak-rpm"].isnull().sum()
            [ ]:
            """ Convert the data type of peak-rpm from object to float """
            df["peak-rpm"].replace("?", np.nan, inplace = True)
            df["peak-rpm"] = pd.to_numeric(df["peak-rpm"])
            [ ]:
            import seaborn as sns
            import matplotlib.pyplot as plt
            sns.regplot(x="peak-rpm", y="Price", data=df)
            plt.ylim(0,)
            plt.xlabel("Peak RPM")
            plt.title("Weak Correlation b/w Peak RPM and Price")
            # Correlation Statistics

            Correlation Statistics¶

            ![image.png](attachment:92a2e3b1-35b4-4429-86bf-39d1f834ffc0.png)

            image.png

            ![image.png](attachment:55ff6735-9806-4def-8956-bad0f55f9f88.png)

            image.png

            ![image.png](attachment:ec5538d3-e1be-44e1-a401-9aa252efb2f5.png)

            image.png

            [ ]:
            """ Convert the data type of horsepower from object to float """
            df["horsepower"].replace("?", 0, inplace = True)
            df["horsepower"] = pd.to_numeric(df["horsepower"])
            df["horsepower"].fillna(0, inplace = True)
            df["Price"].fillna(0, inplace = True)
            [ ]:
            from scipy.stats import pearsonr
            pearson_coef, p_val = pearsonr(df["horsepower"], df["Price"])
            [ ]:
            pearson_coef
            [ ]:
            p_val
            [ ]:

              [2]:
              import pandas as pd
              pd.options.mode.chained_assignment = None # default='warn'
              import numpy as np
              [3]:
              path = "C:/Users/SINDH/Downloads/09_Exploratory_Data_Analysis/09_Exploratory_Data_Analysis/Auto85.csv"
              df = pd.read_csv(path, header = None) #read_csv() assumes data has a header
              [4]:
              df.head()
              [4]:
              0 1 2 3 4 5 6 7 8 9 ... 16 17 18 19 20 21 22 23 24 25
              0 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495
              1 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500
              2 1 ? alfa-romero gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500
              3 2 164 audi gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950
              4 2 164 audi gas std four sedan 4wd front 99.4 ... 136 mpfi 3.19 3.40 8.0 115 5500 18 22 17450

              5 rows × 26 columns

              [5]:
              headers = ["symboling","normalized-losses","make","fuel-type", "aspiration", "num-of-doors", "body-style", "drive-wheels","engine-location", "wheel-base", "length", "width", "height", "curb-weight", "engine-type", "num-of-cylinders", "engine-size", "fuel-system", "bore", "stroke", "compression-ratio", "horsepower", "peak-rpm", "city-mpg", "highway-mpg", "Price"]
              df.columns = headers
              [6]:
              df["Price"].replace("?", np.nan, inplace = True)
              df["Price"] = pd.to_numeric(df["Price"])
              C:\Users\SINDH\AppData\Local\Temp\ipykernel_12840\2782822628.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
              The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
              
              For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
              
              
                df["Price"].replace("?", np.nan, inplace = True)
              
              [7]:
              df["Price"].dtypes
              [7]:
              dtype('float64')
              [8]:
              import seaborn as sns
              [9]:
              sns.boxplot(x="drive-wheels", y = "Price", data = df)
              [9]:
              <Axes: xlabel='drive-wheels', ylabel='Price'>
              [10]:
              import matplotlib.pyplot as plt
              plt.scatter(df["engine-size"], df["Price"])
              plt.title("Relationship b/w Engine Size and Price")
              plt.xlabel("Engine Size")
              plt.ylabel("Price")
              [10]:
              Text(0, 0.5, 'Price')
              [11]:
              df_test = df[ ["drive-wheels","body-style", "Price"] ]
              [12]:
              df_test
              [12]:
              drive-wheels body-style Price
              0 rwd convertible 13495.0
              1 rwd convertible 16500.0
              2 rwd hatchback 16500.0
              3 fwd sedan 13950.0
              4 4wd sedan 17450.0
              ... ... ... ...
              200 rwd sedan 16845.0
              201 rwd sedan 19045.0
              202 rwd sedan 21485.0
              203 rwd sedan 22470.0
              204 rwd sedan 22625.0

              205 rows × 3 columns

              [13]:
              df_grp = df_test.groupby( [ "drive-wheels","body-style"], as_index=False).mean()
              df_grp
              [13]:
              drive-wheels body-style Price
              0 4wd hatchback 7603.000000
              1 4wd sedan 12647.333333
              2 4wd wagon 9095.750000
              3 fwd convertible 11595.000000
              4 fwd hardtop 8249.000000
              5 fwd hatchback 8396.387755
              6 fwd sedan 9811.800000
              7 fwd wagon 9997.333333
              8 rwd convertible 23949.600000
              9 rwd hardtop 24202.714286
              10 rwd hatchback 14337.777778
              11 rwd sedan 21711.833333
              12 rwd wagon 16994.222222
              [14]:
              df_pivot = df_grp.pivot(index = "drive-wheels", columns = "body-style")
              df_pivot
              [14]:
              Price
              body-style convertible hardtop hatchback sedan wagon
              drive-wheels
              4wd NaN NaN 7603.000000 12647.333333 9095.750000
              fwd 11595.0 8249.000000 8396.387755 9811.800000 9997.333333
              rwd 23949.6 24202.714286 14337.777778 21711.833333 16994.222222
              [15]:
              import matplotlib.pyplot as plt
              plt.pcolor(df_pivot, cmap = "BuGn")
              plt.colorbar()
              plt.xlabel("Body Style")
              plt.ylabel("Drive Wheel")
              plt.show()
              C:\Users\SINDH\AppData\Local\Temp\ipykernel_12840\878192950.py:3: MatplotlibDeprecationWarning: Getting the array from a PolyQuadMesh will return the full array in the future (uncompressed). To get this behavior now set the PolyQuadMesh with a 2D array .set_array(data2d).
                plt.colorbar()
              
              [16]:
              df_test = df[ ["drive-wheels","body-style", "Price"] ]
              [17]:
              df_test
              [17]:
              drive-wheels body-style Price
              0 rwd convertible 13495.0
              1 rwd convertible 16500.0
              2 rwd hatchback 16500.0
              3 fwd sedan 13950.0
              4 4wd sedan 17450.0
              ... ... ... ...
              200 rwd sedan 16845.0
              201 rwd sedan 19045.0
              202 rwd sedan 21485.0
              203 rwd sedan 22470.0
              204 rwd sedan 22625.0

              205 rows × 3 columns

              [18]:
              df_grp = df_test.groupby( [ "drive-wheels","body-style"], as_index=False).mean()
              df_grp
              [18]:
              drive-wheels body-style Price
              0 4wd hatchback 7603.000000
              1 4wd sedan 12647.333333
              2 4wd wagon 9095.750000
              3 fwd convertible 11595.000000
              4 fwd hardtop 8249.000000
              5 fwd hatchback 8396.387755
              6 fwd sedan 9811.800000
              7 fwd wagon 9997.333333
              8 rwd convertible 23949.600000
              9 rwd hardtop 24202.714286
              10 rwd hatchback 14337.777778
              11 rwd sedan 21711.833333
              12 rwd wagon 16994.222222
              [19]:
              df_pivot = df_grp.pivot(index = "drive-wheels", columns = "body-style")
              df_pivot
              [19]:
              Price
              body-style convertible hardtop hatchback sedan wagon
              drive-wheels
              4wd NaN NaN 7603.000000 12647.333333 9095.750000
              fwd 11595.0 8249.000000 8396.387755 9811.800000 9997.333333
              rwd 23949.6 24202.714286 14337.777778 21711.833333 16994.222222
              [20]:
              import matplotlib.pyplot as plt
              plt.pcolor(df_pivot, cmap = "BuGn_r")
              plt.colorbar()
              plt.xlabel("Body Style")
              plt.ylabel("Drive Wheel")
              plt.show()
              C:\Users\SINDH\AppData\Local\Temp\ipykernel_12840\2475278966.py:3: MatplotlibDeprecationWarning: Getting the array from a PolyQuadMesh will return the full array in the future (uncompressed). To get this behavior now set the PolyQuadMesh with a 2D array .set_array(data2d).
                plt.colorbar()
              
              [21]:
              import seaborn as sns
              sns.regplot(x="engine-size", y = "Price", data=df)
              plt.ylim(0,)
              plt.xlabel("Engine Size")
              plt.title("Correlation b/w Engine Size and Price")
              [21]:
              Text(0.5, 1.0, 'Correlation b/w Engine Size and Price')
              [22]:
              import seaborn as sns
              import matplotlib.pyplot as plt
              sns.regplot(x="highway-mpg", y="Price", data=df)
              plt.ylim(0,)
              plt.xlabel("highway-mpg")
              plt.title("Negative Correlation b/w highway-mpg and Price")
              [22]:
              Text(0.5, 1.0, 'Negative Correlation b/w highway-mpg and Price')
              [23]:
              """ Convert the data type of horsepower from object to float """
              df["horsepower"].replace("?", 0, inplace = True)
              df["horsepower"] = pd.to_numeric(df["horsepower"])
              df["horsepower"].fillna(0, inplace = True)
              df["Price"].fillna(0, inplace = True)
              C:\Users\SINDH\AppData\Local\Temp\ipykernel_12840\3306160387.py:2: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
              The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
              
              For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
              
              
                df["horsepower"].replace("?", 0, inplace = True)
              C:\Users\SINDH\AppData\Local\Temp\ipykernel_12840\3306160387.py:4: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
              The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
              
              For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
              
              
                df["horsepower"].fillna(0, inplace = True)
              C:\Users\SINDH\AppData\Local\Temp\ipykernel_12840\3306160387.py:5: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
              The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
              
              For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
              
              
                df["Price"].fillna(0, inplace = True)
              
              [25]:
              from scipy.stats import pearsonr
              pearson_coef, p_val = pearsonr(df["horsepower"], df["Price"])
              [26]:
              pearson_coef
              [26]:
              0.6912878787942788
              [27]:
              p_val
              [27]:
              1.8175735366187578e-30
              [ ]:

                # Model Development
                - Linear Regression
                - Prediction
                - Model Evaluation
                - Model Evaluation using Visualization
                - Polynomial Regression

                Model Development¶

                • Linear Regression
                • Prediction
                • Model Evaluation
                • Model Evaluation using Visualization
                • Polynomial Regression
                ## What is a Model?

                What is a Model?¶

                ![image.png](attachment:8122d3ce-a3bd-4ba2-aa8f-079969543509.png)

                image.png

                ![image.png](attachment:ae1d8667-a0d3-40f7-ad48-7fcee469a522.png)

                image.png

                ![image.png](attachment:291927b9-c410-49fe-8f24-c113b726d188.png)

                image.png

                ![image.png](attachment:b85d8571-0a0a-4727-8b82-3a0047a75605.png)

                image.png

                ## Linear Regression

                Linear Regression¶

                ![image.png](attachment:9e74a31c-7a39-48fb-b1f1-e59d3e8206c9.png)

                image.png

                ![image.png](attachment:ba9baa63-cc84-42ee-a7a7-b350c3d01c26.png)

                image.png

                ![image.png](attachment:11b7dcea-ff69-4912-9f00-63bbf34a128d.png)

                image.png

                ![image.png](attachment:94b89d2a-c1a2-4fed-8c31-ee3d4d100a57.png)

                image.png

                ![image.png](attachment:49b994bd-cdf1-42d8-b6af-3e29d87c84dd.png)

                image.png

                ![image.png](attachment:7f0af607-9a98-42a4-b34e-9c2c89311ac7.png)

                image.png

                ![image.png](attachment:5b084510-8f71-4f95-ac8e-089a4c25b642.png)

                image.png

                ![image.png](attachment:9c3f6a9b-7142-438d-87bd-3a5786bdf187.png)

                image.png

                ![image.png](attachment:13d2e963-a6df-4a3d-ac95-a267e3790706.png)

                image.png

                ![image.png](attachment:775f3e3d-fadf-4c95-a915-97b6cac03b22.png)

                image.png

                ### Predict car price based on highway-mpg

                Predict car price based on highway-mpg¶

                [1]:
                import pandas as pd
                import numpy as np
                [2]:
                path = "C:/Users/SINDH/Downloads/09_Exploratory_Data_Analysis/09_Exploratory_Data_Analysis/Auto85.csv"
                df = pd.read_csv(path, header = None) #read_csv() assumes data has a header
                [3]:
                headers = ["symboling","normalized-losses","make","fuel-type", "aspiration", "num-of-doors", "body-style", "drive-wheels","engine-location", "wheel-base", "length", "width", "height", "curb-weight", "engine-type", "num-of-cylinders", "engine-size", "fuel-system", "bore", "stroke", "compression-ratio", "horsepower", "peak-rpm", "city-mpg", "highway-mpg", "Price"]
                df.columns = headers
                ### Pre-Processing
                - Check highway-mpg and Price columns
                - Should be numeric
                - Should not contain any missing data

                Pre-Processing¶

                • Check highway-mpg and Price columns
                • Should be numeric
                • Should not contain any missing data
                [4]:
                # Consider highway-mpg for Price prediction
                df["highway-mpg"]
                [4]:
                0      27
                1      27
                2      26
                3      30
                4      22
                       ..
                200    28
                201    25
                202    23
                203    27
                204    25
                Name: highway-mpg, Length: 205, dtype: int64
                [5]:
                # Check for missing values
                df["highway-mpg"].isnull().sum()
                [5]:
                0
                [6]:
                # Check for Price colum
                df['Price'].dtype
                [6]:
                dtype('O')
                [7]:
                # Convert Price column to numeric
                df["Price"].replace("?", np.nan, inplace = True)
                df["Price"] = pd.to_numeric(df["Price"])
                df["Price"].isnull().sum()
                C:\Users\SINDH\AppData\Local\Temp\ipykernel_10120\2913407292.py:2: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
                The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
                
                For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
                
                
                  df["Price"].replace("?", np.nan, inplace = True)
                
                [7]:
                4
                [8]:
                df.dropna(subset=["Price"], axis=0, inplace = True)
                df["Price"].isnull().sum()
                [8]:
                0
                [9]:
                df[["highway-mpg", "Price"] ]
                [9]:
                highway-mpg Price
                0 27 13495.0
                1 27 16500.0
                2 26 16500.0
                3 30 13950.0
                4 22 17450.0
                ... ... ...
                200 28 16845.0
                201 25 19045.0
                202 23 21485.0
                203 27 22470.0
                204 25 22625.0

                201 rows × 2 columns

                ### Use Scikit Learn Library for Linear Regression

                Use Scikit Learn Library for Linear Regression¶

                [10]:
                # Import Linear Model from scikit-learn
                from sklearn.linear_model import LinearRegression
                [11]:
                # Create a Linear Regression Object
                linear_model = LinearRegression()
                [12]:
                # Define X as feature set, Y as target variable
                X = df[ ["highway-mpg"] ]
                Y = df ["Price"]
                [13]:
                Y
                [13]:
                0      13495.0
                1      16500.0
                2      16500.0
                3      13950.0
                4      17450.0
                        ...   
                200    16845.0
                201    19045.0
                202    21485.0
                203    22470.0
                204    22625.0
                Name: Price, Length: 201, dtype: float64
                [14]:
                # Model fit
                linear_model.fit(X,Y)
                [14]:
                LinearRegression()
                In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
                On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
                LinearRegression()
                [15]:
                print("c_0 = ", linear_model.intercept_)
                print("c_1 = ", linear_model.coef_)
                c_0 =  38423.3058581574
                c_1 =  [-821.73337832]
                
                $ Predicted\_Price = c_0 + c_1 \cdot highway\_mpg $ <br>
                $c_0 = 38423.305$ <br>
                $c_1 = -821.73$ <br>
                $ Predicted\_Price = 38423.305 - 821.73 \cdot highway\_mpg $

                Predicted_Price=c0+c1⋅highway_mpg
                c0=38423.305
                c1=−821.73
                Predicted_Price=38423.305−821.73⋅highway_mpg

                [16]:
                hmpg = 15

                linear_model.predict( np.array( [[hmpg]] ) )
                C:\Users\SINDH\AppData\Roaming\Python\Python312\site-packages\sklearn\base.py:493: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names
                  warnings.warn(
                
                [16]:
                array([26097.30518333])


                ## Multi Variable Linear Regression

                - Predict price of a car based on horsepower, curb-weight, engine-size, highway-mpg
                - Make sure to check the types of all columns

                Multi Variable Linear Regression¶

                • Predict price of a car based on horsepower, curb-weight, engine-size, highway-mpg
                • Make sure to check the types of all columns
                [17]:
                df[["horsepower", "curb-weight", "engine-size", "highway-mpg"]].dtypes
                [17]:
                horsepower     object
                curb-weight     int64
                engine-size     int64
                highway-mpg     int64
                dtype: object
                [18]:
                # Convert horse-power column to numeric
                df["horsepower"].replace("?", np.nan, inplace = True)
                df["horsepower"] = pd.to_numeric(df["horsepower"])
                df["horsepower"].isnull().sum()
                C:\Users\SINDH\AppData\Local\Temp\ipykernel_10120\2628235679.py:2: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
                The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
                
                For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
                
                
                  df["horsepower"].replace("?", np.nan, inplace = True)
                
                [18]:
                2
                [19]:
                # Replace missing values by ave values
                ave_hp = df["horsepower"].mean()
                df["horsepower"].replace(np.nan, ave_hp, inplace = True)
                df["horsepower"].isnull().sum()
                C:\Users\SINDH\AppData\Local\Temp\ipykernel_10120\3197711793.py:3: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
                The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
                
                For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
                
                
                  df["horsepower"].replace(np.nan, ave_hp, inplace = True)
                
                [19]:
                0
                [20]:
                # Check for NaN or missing values for other attributes
                print(df["curb-weight"].isnull().sum())
                print(df["engine-size"].isnull().sum())
                print(df["highway-mpg"].isnull().sum())
                print(df["Price"].isnull().sum())
                0
                0
                0
                0
                
                [21]:
                X = df[["horsepower", "curb-weight", "engine-size", "highway-mpg"]]
                Y = df["Price"]
                linear_model.fit(X, Y)
                [21]:
                LinearRegression()
                In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
                On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
                LinearRegression()
                [22]:
                print("c_0 = ", linear_model.intercept_)
                print("c_{1-4} = ", linear_model.coef_)
                c_0 =  -15824.038208234477
                c_{1-4} =  [53.61042729  4.70886444 81.47225667 36.39637823]
                
                $ Y = c_0 + c_1 x_1 + c_2 x_2 + c_3 x_3 + c_4 x_4 $ <br>
                $ Predicted\_Price = c_0 + c_1(horsepower) + c_2(curb\_weight) + c_3(engine\_size) + c_4(highway\_mpg) $ <br>
                $c_0 = -15824.038$ <br>
                $c_1 = 53.61$ <br>
                $c_2 = 4.7$ <br>
                $c_3 = 81.47 $ <br>
                $c_4 = 36.39$ <br>
                $ Predicted\_Price = -15824.038 + 53.61(horsepower) + 4.7(curb\_weight) + 81.47( engine\_size) + 36.39(highway\_mpg) $

                Y=c0+c1x1+c2x2+c3x3+c4x4
                Predicted_Price=c0+c1(horsepower)+c2(curb_weight)+c3(engine_size)+c4(highway_mpg)
                c0=−15824.038
                c1=53.61
                c2=4.7
                c3=81.47
                c4=36.39
                Predicted_Price=−15824.038+53.61(horsepower)+4.7(curb_weight)+81.47(engine_size)+36.39(highway_mpg)

                [23]:
                linear_model.predict(np.array( [ [400,24,1000,12] ] ))
                C:\Users\SINDH\AppData\Roaming\Python\Python312\site-packages\sklearn\base.py:493: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names
                  warnings.warn(
                
                [23]:
                array([87642.15866469])
                ## Model Evaluation
                - Mean Squared Error (MSE)
                - R Squared Error

                Model Evaluation¶

                • Mean Squared Error (MSE)
                • R Squared Error
                ![image.png](attachment:b57b8b39-6c6a-4c4f-bf9a-a62b8e5a55c6.png)

                image.png

                ![image.png](attachment:b43f78e8-2db8-47b3-b307-f2cedc48f092.png)

                image.png

                ![image.png](attachment:6033f2e4-3ace-4853-b1de-47e048dec544.png)

                image.png

                ![image.png](attachment:11d7da9b-0669-42a9-859a-09b64269a07c.png)

                image.png

                [24]:
                from sklearn.metrics import mean_squared_error
                [25]:
                # Mean Squared Error (MSE)
                Y_hat = linear_model.predict(X)
                mean_squared_error(df["Price"], Y_hat)
                [25]:
                11976801.681229591
                ## Model Evaluation using Visualization

                Model Evaluation using Visualization¶

                ![image.png](attachment:08c2fc2d-d15e-43b3-80a6-9540fddc9338.png)

                image.png

                ![image.png](attachment:71adaad3-64fb-4303-be3f-ae10e91cc3df.png)

                image.png

                [ ]:
                import seaborn as sns
                import matplotlib.pyplot as plt
                [ ]:
                sns.regplot(x = "highway-mpg", y = "Price", data = df)
                plt.ylim(0,)
                ![image.png](attachment:aad81f01-4dfb-47f8-907b-6d4aada13f74.png)

                image.png

                [ ]:
                sns.residplot(x= "engine-size", y = "Price", data = df)
                If the residual values are randomly spread out around x-axis then a linear model is appropriate

                If the residual values are randomly spread out around x-axis then a linear model is appropriate

                ![image.png](attachment:c12e52c1-d7d5-434c-8164-09051e5d5094.png)

                image.png

                ![image.png](attachment:021c5f0a-c5dc-439b-b42a-8060dfa1b425.png)

                image.png

                [ ]:

                  [ ]:

                  [ ]:

                  ##### K-Means Clustering
                  ## Problem Statement

                  - The following features are available for California houses in a specific locality obtained from 1990 census data;
                  - Longitude
                  - Latitude
                  - Housing Median Age
                  - Total Rooms
                  - Total Bedrooms
                  - Population
                  - Households
                  - Median Income
                  - Median House Value
                  - Ocean Proximity

                  - Create clusters/groups of houses based on selected set of features.

                  Problem Statement¶

                  • The following features are available for California houses in a specific locality obtained from 1990 census data;

                    • Longitude
                    • Latitude
                    • Housing Median Age
                    • Total Rooms
                    • Total Bedrooms
                    • Population
                    • Households
                    • Median Income
                    • Median House Value
                    • Ocean Proximity
                  • Create clusters/groups of houses based on selected set of features.

                  ## Acknowledgement / Source

                  - Data
                  - https://www.kaggle.com/datasets/camnugent/california-housing-prices/
                  - Code
                  - https://www.datacamp.com/tutorial/k-means-clustering-python

                  Acknowledgement / Source¶

                  • Data
                    • https://www.kaggle.com/datasets/camnugent/california-housing-prices/
                  • Code
                    • https://www.datacamp.com/tutorial/k-means-clustering-python
                  ## Importing Libraries

                  Importing Libraries¶

                  [ ]:
                  import pandas as pd
                  import seaborn as sns

                  from sklearn.model_selection import train_test_split
                  from sklearn import preprocessing
                  from sklearn.cluster import KMeans
                  from sklearn.metrics import silhouette_score
                  ## Loading the Dataset

                  Loading the Dataset¶

                  [ ]:
                  home_data = pd.read_csv('Data/CaliforniaHousingPrices.csv')
                  home_data.head()
                  [ ]:
                  # Select only 3 features for our case study namely longtude, Latitude and Median House Value

                  home_data = home_data[['longitude', 'latitude', 'median_house_value']]

                  home_data.head()
                  [ ]:
                  home_data.shape
                  ## Visualize the Data

                  Visualize the Data¶

                  [ ]:
                  # 'median_house_value' column is used to color-code the data points
                  sns.scatterplot(data = home_data, x = 'longitude', y = 'latitude', hue = 'median_house_value')
                  ## Pre-Processing

                  Pre-Processing¶

                  [ ]:
                  #from sklearn.model_selection import train_test_split

                  #X_train, X_test, y_train, y_test = train_test_split(home_data[['latitude', 'longitude']], home_data[['median_house_value']], test_size=0.33, random_state=0)
                  [ ]:
                  #from sklearn import preprocessing

                  X = home_data[['latitude', 'longitude']]

                  X_norm = preprocessing.normalize(X)
                  ## Model

                  Model¶

                  [ ]:
                  #from sklearn.cluster import KMeans

                  kmeans = KMeans(n_clusters = 3, random_state = 0, n_init='auto')
                  kmeans.fit(X_norm)
                  [ ]:
                  sns.scatterplot(data = X, x = 'longitude', y = 'latitude', hue = kmeans.labels_)
                  [ ]:
                  house_values = home_data['median_house_value']
                  sns.boxplot(x = kmeans.labels_, y = house_values)
                  ### Silhouette (si·loo·**et**) Score

                  - **Scores closer to 1**: Indicate well-separated clusters, suggesting the clustering is likely effective in capturing the underlying structure in the data.
                  - **Scores around 0**: Indicate clusters with some overlap, and you might consider adjusting the number of clusters or the clustering algorithm to see if you can achieve better separation.
                  - **Negative scores**: Suggest that some data points are potentially assigned to the wrong cluster, and you might need to explore alternative clustering strategies.

                  Silhouette (si·loo·et) Score¶

                  • Scores closer to 1: Indicate well-separated clusters, suggesting the clustering is likely effective in capturing the underlying structure in the data.
                  • Scores around 0: Indicate clusters with some overlap, and you might consider adjusting the number of clusters or the clustering algorithm to see if you can achieve better separation.
                  • Negative scores: Suggest that some data points are potentially assigned to the wrong cluster, and you might need to explore alternative clustering strategies.
                  [ ]:
                  #from sklearn.metrics import silhouette_score
                  silhouette_score(X_norm, kmeans.labels_, metric='euclidean')
                  ## Choosing the Number of Clusters

                  Choosing the Number of Clusters¶

                  [ ]:
                  K = range(2, 8)
                  fits = []
                  score = []


                  for k in K:
                  # train the model for current value of k on training data
                  model = KMeans(n_clusters = k, random_state = 0, n_init='auto').fit(X_norm)
                  # append the model to fits
                  fits.append(model)
                  # Append the silhouette score to scores
                  score.append(silhouette_score(X_norm, model.labels_, metric='euclidean'))
                  [ ]:
                  sns.scatterplot(data = X, x = 'longitude', y = 'latitude', hue = fits[0].labels_)
                  [ ]:
                  sns.scatterplot(data = X, x = 'longitude', y = 'latitude', hue = fits[2].labels_)
                  [ ]:
                  sns.scatterplot(data = X, x = 'longitude', y = 'latitude', hue = fits[5].labels_)
                  [ ]:
                  sns.lineplot(x = K, y = score)
                  [ ]:
                  sns.scatterplot(data = X, x = 'longitude', y = 'latitude', hue = fits[3].labels_)
                  [ ]:
                  sns.boxplot(x = fits[3].labels_, y = house_values)
                  [ ]:

                  Python 3 (ipykernel)
                  Kernel status: Idle Executed 1 cellElapsed time: 10 seconds
                    [1]:
                    import pandas as pd
                    import seaborn as sns

                    from sklearn.model_selection import train_test_split
                    from sklearn import preprocessing
                    from sklearn.cluster import KMeans
                    from sklearn.metrics import silhouette_score
                    [2]:
                    path = "C:/Users/SINDH/Downloads/CaliforniaHousingPrices.csv"
                    home_data = pd.read_csv(path) #read_csv() assumes data has a header
                    [4]:
                    home_data.head()
                    [4]:
                    longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
                    0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
                    1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
                    2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
                    3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
                    4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
                    [5]:
                    home_data = home_data[['longitude', 'latitude', 'median_house_value']]

                    home_data.head()
                    [5]:
                    longitude latitude median_house_value
                    0 -122.23 37.88 452600.0
                    1 -122.22 37.86 358500.0
                    2 -122.24 37.85 352100.0
                    3 -122.25 37.85 341300.0
                    4 -122.25 37.85 342200.0
                    [6]:
                    home_data.shape
                    [6]:
                    (20640, 3)
                    [7]:
                    # 'median_house_value' column is used to color-code the data points
                    sns.scatterplot(data = home_data, x = 'longitude', y = 'latitude', hue = 'median_house_value')
                    [7]:
                    <Axes: xlabel='longitude', ylabel='latitude'>
                    [8]:
                    X = home_data[['latitude', 'longitude']]

                    X_norm = preprocessing.normalize(X)
                    [9]:
                    kmeans = KMeans(n_clusters = 3, random_state = 0, n_init='auto')
                    kmeans.fit(X_norm)
                    [9]:
                    KMeans(n_clusters=3, random_state=0)
                    In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
                    On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
                    KMeans(n_clusters=3, random_state=0)
                    [10]:
                    sns.scatterplot(data = X, x = 'longitude', y = 'latitude', hue = kmeans.labels_)
                    [10]:
                    <Axes: xlabel='longitude', ylabel='latitude'>
                    [11]:
                    house_values = home_data['median_house_value']
                    sns.boxplot(x = kmeans.labels_, y = house_values)
                    [11]:
                    <Axes: ylabel='median_house_value'>
                    [12]:
                    silhouette_score(X_norm, kmeans.labels_, metric='euclidean')
                    [12]:
                    0.7499115323584772
                    [13]:
                    K = range(2, 8)
                    fits = []
                    score = []


                    for k in K:
                    # train the model for current value of k on training data
                    model = KMeans(n_clusters = k, random_state = 0, n_init='auto').fit(X_norm)
                    # append the model to fits
                    fits.append(model)
                    # Append the silhouette score to scores
                    score.append(silhouette_score(X_norm, model.labels_, metric='euclidean'))
                    [14]:
                    sns.scatterplot(data = X, x = 'longitude', y = 'latitude', hue = fits[0].labels_)
                    [14]:
                    <Axes: xlabel='longitude', ylabel='latitude'>
                    [15]:
                    sns.scatterplot(data = X, x = 'longitude', y = 'latitude', hue = fits[2].labels_)
                    [15]:
                    <Axes: xlabel='longitude', ylabel='latitude'>
                    [16]:
                    sns.scatterplot(data = X, x = 'longitude', y = 'latitude', hue = fits[5].labels_)
                    [16]:
                    <Axes: xlabel='longitude', ylabel='latitude'>
                    [17]:
                    sns.lineplot(x = K, y = score)
                    [17]:
                    <Axes: >
                    [18]:
                    sns.scatterplot(data = X, x = 'longitude', y = 'latitude', hue = fits[3].labels_)
                    [18]:
                    <Axes: xlabel='longitude', ylabel='latitude'>
                    [19]:
                    sns.boxplot(x = fits[3].labels_, y = house_values)
                    [19]:
                    <Axes: ylabel='median_house_value'>
                    [20]:
                    path = "C:/Users/SINDH/Downloads/CaliforniaHousingPrices.csv"
                    home_data = pd.read_csv(path) #read_csv() assumes data has a header
                    [21]:
                    selected_features = home_data[ ["housing_median_age", "total_rooms", "total_bedrooms", "population"] ]

                    selected_features.describe()
                    [21]:
                    housing_median_age total_rooms total_bedrooms population
                    count 20640.000000 20640.000000 20433.000000 20640.000000
                    mean 28.639486 2635.763081 537.870553 1425.476744
                    std 12.585558 2181.615252 421.385070 1132.462122
                    min 1.000000 2.000000 1.000000 3.000000
                    25% 18.000000 1447.750000 296.000000 787.000000
                    50% 29.000000 2127.000000 435.000000 1166.000000
                    75% 37.000000 3148.000000 647.000000 1725.000000
                    max 52.000000 39320.000000 6445.000000 35682.000000
                    [22]:
                    selected_features.head(10)
                    [22]:
                    housing_median_age total_rooms total_bedrooms population
                    0 41.0 880.0 129.0 322.0
                    1 21.0 7099.0 1106.0 2401.0
                    2 52.0 1467.0 190.0 496.0
                    3 52.0 1274.0 235.0 558.0
                    4 52.0 1627.0 280.0 565.0
                    5 52.0 919.0 213.0 413.0
                    6 52.0 2535.0 489.0 1094.0
                    7 52.0 3104.0 687.0 1157.0
                    8 42.0 2555.0 665.0 1206.0
                    9 52.0 3549.0 707.0 1551.0
                    [23]:
                    import numpy as np
                    [24]:
                    selected_features["housing_median_age"].replace("?", np.nan, inplace = True)
                    C:\Users\SINDH\AppData\Local\Temp\ipykernel_1252\245617363.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
                    The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
                    
                    For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
                    
                    
                      selected_features["housing_median_age"].replace("?", np.nan, inplace = True)
                    C:\Users\SINDH\AppData\Local\Temp\ipykernel_1252\245617363.py:1: SettingWithCopyWarning: 
                    A value is trying to be set on a copy of a slice from a DataFrame
                    
                    See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
                      selected_features["housing_median_age"].replace("?", np.nan, inplace = True)
                    
                    [25]:
                    selected_features["housing_median_age"].isnull().sum()
                    [25]:
                    0
                    [26]:
                    selected_features["total_rooms"].replace("?", np.nan, inplace = True)
                    C:\Users\SINDH\AppData\Local\Temp\ipykernel_1252\668685475.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
                    The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
                    
                    For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
                    
                    
                      selected_features["total_rooms"].replace("?", np.nan, inplace = True)
                    C:\Users\SINDH\AppData\Local\Temp\ipykernel_1252\668685475.py:1: SettingWithCopyWarning: 
                    A value is trying to be set on a copy of a slice from a DataFrame
                    
                    See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
                      selected_features["total_rooms"].replace("?", np.nan, inplace = True)
                    
                    [27]:
                    selected_features["total_rooms"].isnull().sum()
                    [27]:
                    0
                    [28]:
                    selected_features["total_bedrooms"].replace("?", np.nan, inplace= True)
                    C:\Users\SINDH\AppData\Local\Temp\ipykernel_1252\344786479.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
                    The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
                    
                    For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
                    
                    
                      selected_features["total_bedrooms"].replace("?", np.nan, inplace= True)
                    C:\Users\SINDH\AppData\Local\Temp\ipykernel_1252\344786479.py:1: SettingWithCopyWarning: 
                    A value is trying to be set on a copy of a slice from a DataFrame
                    
                    See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
                      selected_features["total_bedrooms"].replace("?", np.nan, inplace= True)
                    
                    [29]:
                    selected_features["total_bedrooms"].isnull().sum()
                    [29]:
                    207
                    [30]:
                    selected_features["population"].replace("?", np.nan, inplace = True)
                    C:\Users\SINDH\AppData\Local\Temp\ipykernel_1252\2879441710.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
                    The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
                    
                    For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
                    
                    
                      selected_features["population"].replace("?", np.nan, inplace = True)
                    C:\Users\SINDH\AppData\Local\Temp\ipykernel_1252\2879441710.py:1: SettingWithCopyWarning: 
                    A value is trying to be set on a copy of a slice from a DataFrame
                    
                    See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
                      selected_features["population"].replace("?", np.nan, inplace = True)
                    
                    [31]:
                    selected_features["population"].isnull().sum()
                    [31]:
                    0
                    [32]:
                    selected_features_mean = selected_features["total_bedrooms"].mean()
                    selected_features["total_bedrooms"].replace(np.nan, selected_features_mean, inplace= True)
                    C:\Users\SINDH\AppData\Local\Temp\ipykernel_1252\3851724696.py:2: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
                    The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
                    
                    For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
                    
                    
                      selected_features["total_bedrooms"].replace(np.nan, selected_features_mean, inplace= True)
                    C:\Users\SINDH\AppData\Local\Temp\ipykernel_1252\3851724696.py:2: SettingWithCopyWarning: 
                    A value is trying to be set on a copy of a slice from a DataFrame
                    
                    See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
                      selected_features["total_bedrooms"].replace(np.nan, selected_features_mean, inplace= True)
                    
                    [33]:
                    selected_features["total_bedrooms"].isnull().sum()
                    [33]:
                    0
                    [34]:
                    selected_features.groupby("population")["total_bedrooms"].mean()
                    [34]:
                    population
                    3.0           6.00
                    5.0           3.00
                    6.0           2.00
                    8.0           3.25
                    9.0           7.00
                                ...   
                    15507.0    5290.00
                    16122.0    5471.00
                    16305.0    6210.00
                    28566.0    6445.00
                    35682.0    4819.00
                    Name: total_bedrooms, Length: 3888, dtype: float64
                    [35]:
                    selected_features_norm = preprocessing.normalize(selected_features)
                    [36]:
                    print(selected_features_norm)
                    [[0.04330435 0.92945912 0.13625026 0.34009754]
                     [0.00277219 0.93713188 0.14600195 0.3169536 ]
                     [0.03331069 0.93974594 0.12171215 0.31773278]
                     ...
                     [0.00675685 0.89587898 0.19276899 0.40024407]
                     [0.008808   0.91016017 0.20013737 0.36259607]
                     [0.00504461 0.87807691 0.19421737 0.43730437]]
                    
                    [37]:
                    from sklearn.preprocessing import RobustScaler
                    scaler = preprocessing.RobustScaler()
                    selected_features_scaler = scaler.fit_transform(selected_features)
                    [38]:
                    selected_features_scaler
                    [38]:
                    array([[ 0.63157895, -0.73342156, -0.89241877, -0.89978678],
                           [-0.42105263,  2.92427584,  1.92924188,  1.31663113],
                           [ 1.21052632, -0.38817821, -0.71624549, -0.71428571],
                           ...,
                           [-0.63157895,  0.0746949 ,  0.13574007, -0.16950959],
                           [-0.57894737, -0.15703573, -0.08375451, -0.45309168],
                           [-0.68421053,  0.38700191,  0.51407942,  0.23560768]])
                    [39]:
                    selected_features_norm = preprocessing.normalize(selected_features_scaler)
                    [40]:
                    selected_features_norm
                    [40]:
                    array([[ 0.39606755, -0.45993376, -0.55964202, -0.56426255],
                           [-0.11179811,  0.77645525,  0.51225331,  0.34959258],
                           [ 0.74513076, -0.2389403 , -0.44087975, -0.43967343],
                           ...,
                           [-0.93980159,  0.11114744,  0.20198383, -0.25223353],
                           [-0.76539427, -0.20760825, -0.11072721, -0.59900744],
                           [-0.70657304,  0.39965056,  0.53088143,  0.2433082 ]])
                    [73]:
                    kmeans = KMeans(n_clusters = 4, random_state = 0, n_init='auto')
                    kmeans.fit(selected_features_norm)
                    [73]:
                    KMeans(n_clusters=4, random_state=0)
                    In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
                    On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
                    KMeans(n_clusters=4, random_state=0)
                    [74]:
                    from sklearn.decomposition import PCA
                    [75]:
                    pca = PCA(n_components = 2)
                    pca.fit(selected_features_norm)
                    pca_data = pca.transform(selected_features_norm)
                    pca_data = pd.DataFrame(pca_data, columns = ['PC1', "PC2"])
                    pca_data.head()
                    [75]:
                    PC1 PC2
                    0 1.009226 -0.073469
                    1 -0.923991 -0.085173
                    2 0.846904 -0.475885
                    3 0.844834 -0.508459
                    4 0.769151 -0.577098
                    [76]:
                    sns.scatterplot(data = pca_data, x = 'PC1', y = 'PC2')
                    [76]:
                    <Axes: xlabel='PC1', ylabel='PC2'>
                    [77]:
                    sns.scatterplot(data = pca_data, x = 'PC1', y = 'PC2', hue = kmeans.labels_)
                    [77]:
                    <Axes: xlabel='PC1', ylabel='PC2'>
                    [78]:
                    sns.boxplot(x = kmeans.labels_, y = pca_data["PC1"])
                    [78]:
                    <Axes: ylabel='PC1'>
                    [79]:
                    silhouette_score(pca_data, kmeans.labels_, metric='euclidean')
                    [79]:
                    0.5273664489203738
                    [ ]:

                      ## Logistic Regression
                      - The notebook implements a logistic regression model to classify bank notes as 'authentic' or 'fake'
                      - We use a data set with the following features;
                      - Variance of Wavelet Transformed image (continuous)
                      - Skewness of Wavelet Transformed image (continuous)
                      - Curtosis of Wavelet Transformed image (continuous)
                      - Entropy of image (continuous)
                      - Class (integer)
                      - Total Instance : 1372
                      - Data Source
                      - https://archive.ics.uci.edu/ml/datasets/banknote+authentication

                      Logistic Regression¶

                      • The notebook implements a logistic regression model to classify bank notes as 'authentic' or 'fake'
                      • We use a data set with the following features;
                        • Variance of Wavelet Transformed image (continuous)
                        • Skewness of Wavelet Transformed image (continuous)
                        • Curtosis of Wavelet Transformed image (continuous)
                        • Entropy of image (continuous)
                        • Class (integer)
                      • Total Instance : 1372
                      • Data Source
                        • https://archive.ics.uci.edu/ml/datasets/banknote+authentication
                      [ ]:
                      import pandas as pd
                      import numpy as np
                      from sklearn.linear_model import LogisticRegression
                      from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
                      from sklearn import preprocessing
                      [ ]:
                      #path = "Data\BankNotes_Training.csv"
                      path = "BankNotes_Training.csv"
                      df = pd.read_csv(path)
                      [ ]:
                      df.columns
                      [ ]:
                      df.head()
                      [ ]:
                      df.shape
                      [ ]:
                      X = df.iloc[:, :-1]
                      Y = df.iloc[:,4]
                      [ ]:
                      X
                      [ ]:
                      from sklearn.preprocessing import RobustScaler
                      scaler = preprocessing.RobustScaler()
                      X = scaler.fit_transform(X)
                      [ ]:
                      X
                      # Data Splitting

                      Data Splitting¶

                      ![image.png](attachment:f6ea630d-fbd6-40f2-b661-e87bb4d93347.png)

                      image.png

                      [ ]:
                      from sklearn.model_selection import train_test_split
                      X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=0)
                      [ ]:
                      print(len(X_train))
                      print(len(Y_train))
                      print(len(X_test))
                      print(len(Y_test))
                      [ ]:
                      model = LogisticRegression()
                      model.fit(X_train,Y_train)

                      print(model.intercept_)
                      print(model.coef_)
                      [ ]:
                      # Accuracy
                      score = model.score(X_test, Y_test)
                      [ ]:
                      score
                      ## Evalutation Measures

                      Evalutation Measures¶

                      ### Confusion Matrix
                      ![image.png](attachment:a0f8339e-534b-4003-8b03-bcda63f1ed6d.png)

                      Image Source : https://www.evidentlyai.com/classification-metrics/confusion-matrix

                      Confusion Matrix¶

                      image.png

                      Image Source : https://www.evidentlyai.com/classification-metrics/confusion-matrix

                      ### Accuracy : $ \frac{TP + TN}{TP + FN + FP + TN}$
                      ### Precision : $ \frac{TP }{TP + FP }$

                      ### Recall : $ \frac{TP }{TP + FN }$

                      ### F1-Score : $ 2 \cdot \frac{Precision \cdot Recall }{Precision + Recall }$

                      Accuracy : TP+TNTP+FN+FP+TN¶

                      Precision : TPTP+FP¶

                      Recall : TPTP+FN¶

                      F1-Score : 2⋅Precision⋅RecallPrecision+Recall¶

                      [ ]:
                      Y_pred = model.predict(X_test)

                      accuracy = accuracy_score(Y_test, Y_pred)
                      recall = recall_score(Y_test, Y_pred)
                      precision = precision_score(Y_test, Y_pred)
                      f1 = f1_score(Y_test, Y_pred)

                      print(f"Accuracy: {accuracy:0.3f}")
                      print(f"Recall: {recall:.3f}")
                      print(f"Precision: {precision:.3f}")
                      print(f"F1-score: {f1:.3f}")
                      ## Visualizing Confusion Matrix

                      Visualizing Confusion Matrix¶

                      [ ]:
                      import matplotlib.pyplot as plt
                      from sklearn.metrics import confusion_matrix
                      import seaborn as sns

                      # Create confusion matrix
                      cm = confusion_matrix(Y_test, Y_pred)

                      # Using seaborn for visulazing confusion matrix
                      sns.heatmap(cm, annot=True, fmt="d", cmap="RdBu") # cmap = "YlOrBr"
                      plt.xlabel("Predicted Label")
                      plt.ylabel("True Label")
                      plt.title("Confusion Matrix")
                      plt.show()

                      [ ]:
                      # Visualize as a table
                      import pandas as pd
                      df_cm = pd.DataFrame(cm, index=[0, 1], columns=[0, 1])
                      df_cm.index.name = "True Label"
                      df_cm.columns.name = "Predicted Label"
                      print(df_cm)
                      [ ]:

                        [1]:
                        import pandas as pd
                        import numpy as np
                        import math
                        [2]:
                        path = "C:/Users/SINDH/Downloads/HousingPrices.csv"
                        df = pd.read_csv(path) #read_csv() assumes data has a header
                        [3]:
                        df.head(10)
                        [3]:
                        Area Bedrooms Bathrooms Stories Mainroad Guestroom Basement Hotwaterheating Airconditioning Parking Furnishingstatus Price
                        0 7420 4 2 3 yes no no no yes 2 furnished 13300000
                        1 8960 ? 4 4 yes no no no yes 3 furnished 12250000
                        2 ? 3 2 2 yes no yes no no 2 semi-furnished 12250000
                        3 7500 4 2 2 yes no yes no yes 3 furnished 12215000
                        4 7420 4 1 2 yes yes yes no yes 2 furnished 11410000
                        5 7500 3 3 1 yes no yes no yes 2 semi-furnished 10850000
                        6 8580 4 3 4 yes no no no yes 2 semi-furnished 10150000
                        7 16200 5 3 ? yes no no no no 0 unfurnished 10150000
                        8 8100 4 1 2 yes yes yes no yes 2 furnished 9870000
                        9 5750 3 2 4 yes yes no no yes 1 unfurnished 9800000
                        [4]:
                        df.drop(["Guestroom", "Basement", "Hotwaterheating", "Airconditioning"], axis=1, inplace=True)
                        [5]:
                        df.head(10)
                        [5]:
                        Area Bedrooms Bathrooms Stories Mainroad Parking Furnishingstatus Price
                        0 7420 4 2 3 yes 2 furnished 13300000
                        1 8960 ? 4 4 yes 3 furnished 12250000
                        2 ? 3 2 2 yes 2 semi-furnished 12250000
                        3 7500 4 2 2 yes 3 furnished 12215000
                        4 7420 4 1 2 yes 2 furnished 11410000
                        5 7500 3 3 1 yes 2 semi-furnished 10850000
                        6 8580 4 3 4 yes 2 semi-furnished 10150000
                        7 16200 5 3 ? yes 0 unfurnished 10150000
                        8 8100 4 1 2 yes 2 furnished 9870000
                        9 5750 3 2 4 yes 1 unfurnished 9800000
                        [6]:
                        df["Area"].isnull().sum()
                        [6]:
                        0
                        [7]:
                        df.isna().sum()
                        [7]:
                        Area                0
                        Bedrooms            0
                        Bathrooms           0
                        Stories             0
                        Mainroad            0
                        Parking             0
                        Furnishingstatus    0
                        Price               0
                        dtype: int64
                        [8]:
                        df.head(10)
                        [8]:
                        Area Bedrooms Bathrooms Stories Mainroad Parking Furnishingstatus Price
                        0 7420 4 2 3 yes 2 furnished 13300000
                        1 8960 ? 4 4 yes 3 furnished 12250000
                        2 ? 3 2 2 yes 2 semi-furnished 12250000
                        3 7500 4 2 2 yes 3 furnished 12215000
                        4 7420 4 1 2 yes 2 furnished 11410000
                        5 7500 3 3 1 yes 2 semi-furnished 10850000
                        6 8580 4 3 4 yes 2 semi-furnished 10150000
                        7 16200 5 3 ? yes 0 unfurnished 10150000
                        8 8100 4 1 2 yes 2 furnished 9870000
                        9 5750 3 2 4 yes 1 unfurnished 9800000
                        [9]:
                        df["Area"].replace("?", np.nan, inplace = True)
                        df["Area"] = pd.to_numeric(df["Area"])
                        C:\Users\SINDH\AppData\Local\Temp\ipykernel_10144\1078061793.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
                        The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
                        
                        For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
                        
                        
                          df["Area"].replace("?", np.nan, inplace = True)
                        
                        [10]:
                        df["Area"].isnull().sum()
                        [10]:
                        9
                        [11]:
                        df["Bedrooms"].replace("?", np.nan, inplace = True)
                        df["Bedrooms"] = pd.to_numeric(df["Bedrooms"])
                        df["Bedrooms"].dtypes
                        C:\Users\SINDH\AppData\Local\Temp\ipykernel_10144\2500254421.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
                        The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
                        
                        For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
                        
                        
                          df["Bedrooms"].replace("?", np.nan, inplace = True)
                        
                        [11]:
                        dtype('float64')
                        [12]:
                        df["Bedrooms"].isnull().sum()
                        [12]:
                        7
                        [13]:
                        #df.dropna(subset=['Bedrooms'], inplace=True)
                        [14]:
                        df["Bathrooms"].replace("?", np.nan, inplace = True)
                        df["Bathrooms"] = pd.to_numeric(df["Bathrooms"])
                        C:\Users\SINDH\AppData\Local\Temp\ipykernel_10144\2709543407.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
                        The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
                        
                        For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
                        
                        
                          df["Bathrooms"].replace("?", np.nan, inplace = True)
                        
                        [15]:
                        df["Bathrooms"].isnull().sum()
                        [15]:
                        5
                        [16]:
                        #df.dropna(subset=['Bathrooms'], inplace=True)
                        [17]:
                        df["Stories"].replace("?", np.nan, inplace = True)
                        df["Stories"] = pd.to_numeric(df["Stories"])
                        C:\Users\SINDH\AppData\Local\Temp\ipykernel_10144\1854985383.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
                        The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
                        
                        For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
                        
                        
                          df["Stories"].replace("?", np.nan, inplace = True)
                        
                        [18]:
                        df["Stories"].isnull().sum()
                        [18]:
                        7
                        [19]:
                        #df.dropna(subset=['Stories'], inplace=True)
                        [20]:
                        df["Mainroad"].replace("?", np.nan, inplace = True)
                        C:\Users\SINDH\AppData\Local\Temp\ipykernel_10144\959753586.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
                        The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
                        
                        For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
                        
                        
                          df["Mainroad"].replace("?", np.nan, inplace = True)
                        
                        [21]:
                        df["Mainroad"].isnull().sum()
                        [21]:
                        0
                        [22]:
                        df["Parking"].replace("?", np.nan, inplace = True)
                        df["Parking"] = pd.to_numeric(df["Parking"])
                        C:\Users\SINDH\AppData\Local\Temp\ipykernel_10144\1665768048.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
                        The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
                        
                        For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
                        
                        
                          df["Parking"].replace("?", np.nan, inplace = True)
                        
                        [23]:
                        df["Parking"].isnull().sum()
                        [23]:
                        6
                        [24]:
                        #df.dropna(subset=['Parking'], inplace=True)
                        [25]:
                        df["Furnishingstatus"].replace("?", np.nan, inplace = True)
                        C:\Users\SINDH\AppData\Local\Temp\ipykernel_10144\2512529072.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
                        The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
                        
                        For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
                        
                        
                          df["Furnishingstatus"].replace("?", np.nan, inplace = True)
                        
                        [26]:
                        df["Furnishingstatus"].isnull().sum()
                        [26]:
                        0
                        [27]:
                        df["Price"].replace("?", np.nan, inplace = True)
                        df["Price"] = pd.to_numeric(df["Price"])
                        C:\Users\SINDH\AppData\Local\Temp\ipykernel_10144\2782822628.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
                        The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
                        
                        For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
                        
                        
                          df["Price"].replace("?", np.nan, inplace = True)
                        
                        [28]:
                        df["Price"].isnull().sum()
                        [28]:
                        3
                        [29]:
                        df.dropna(subset=['Price'], inplace=True)
                        df["Price"]
                        [29]:
                        0      13300000.0
                        1      12250000.0
                        2      12250000.0
                        3      12215000.0
                        4      11410000.0
                                  ...    
                        540     1820000.0
                        541     1767150.0
                        542     1750000.0
                        543     1750000.0
                        544     1750000.0
                        Name: Price, Length: 542, dtype: float64
                        [30]:
                        mean_area = df["Area"].mean()
                        [31]:
                        df.describe()
                        [31]:
                        Area Bedrooms Bathrooms Stories Parking Price
                        count 533.000000 535.000000 537.000000 535.000000 536.000000 5.420000e+02
                        mean 5155.748593 2.971963 1.286778 1.809346 0.690299 4.767167e+06
                        std 2167.206723 0.737669 0.503414 0.872546 0.860981 1.875564e+06
                        min 1650.000000 1.000000 1.000000 1.000000 0.000000 1.750000e+06
                        25% 3600.000000 3.000000 1.000000 1.000000 0.000000 3.430000e+06
                        50% 4600.000000 3.000000 1.000000 2.000000 0.000000 4.340000e+06
                        75% 6360.000000 3.000000 2.000000 2.000000 1.000000 5.766250e+06
                        max 16200.000000 6.000000 4.000000 4.000000 3.000000 1.330000e+07
                        [32]:
                        df.dtypes
                        df["Area"]
                        [32]:
                        0      7420.0
                        1      8960.0
                        2         NaN
                        3      7500.0
                        4      7420.0
                                ...  
                        540    3000.0
                        541    2400.0
                        542    3620.0
                        543    2910.0
                        544    3850.0
                        Name: Area, Length: 542, dtype: float64
                        [33]:
                        df["Area"].replace(np.nan, mean_area, inplace=True)
                        df.head(80)
                        [33]:
                        Area Bedrooms Bathrooms Stories Mainroad Parking Furnishingstatus Price
                        0 7420.000000 4.0 2.0 3.0 yes 2.0 furnished 13300000.0
                        1 8960.000000 NaN 4.0 4.0 yes 3.0 furnished 12250000.0
                        2 5155.748593 3.0 2.0 2.0 yes 2.0 semi-furnished 12250000.0
                        3 7500.000000 4.0 2.0 2.0 yes 3.0 furnished 12215000.0
                        4 7420.000000 4.0 1.0 2.0 yes 2.0 furnished 11410000.0
                        ... ... ... ... ... ... ... ... ...
                        75 4260.000000 4.0 2.0 2.0 yes 0.0 semi-furnished 6650000.0
                        76 5155.748593 3.0 2.0 3.0 yes 0.0 furnished 6650000.0
                        77 6500.000000 3.0 2.0 3.0 yes 0.0 furnished 6650000.0
                        78 5700.000000 3.0 1.0 1.0 yes 2.0 furnished 6650000.0
                        79 6000.000000 3.0 2.0 3.0 yes 0.0 furnished 6650000.0

                        80 rows × 8 columns

                        [34]:
                        df[["Area","Bedrooms"]]
                        [34]:
                        Area Bedrooms
                        0 7420.000000 4.0
                        1 8960.000000 NaN
                        2 5155.748593 3.0
                        3 7500.000000 4.0
                        4 7420.000000 4.0
                        ... ... ...
                        540 3000.000000 2.0
                        541 2400.000000 3.0
                        542 3620.000000 2.0
                        543 2910.000000 3.0
                        544 3850.000000 3.0

                        542 rows × 2 columns

                        [35]:
                        df.head(40)
                        [35]:
                        Area Bedrooms Bathrooms Stories Mainroad Parking Furnishingstatus Price
                        0 7420.000000 4.0 2.0 3.0 yes 2.0 furnished 13300000.0
                        1 8960.000000 NaN 4.0 4.0 yes 3.0 furnished 12250000.0
                        2 5155.748593 3.0 2.0 2.0 yes 2.0 semi-furnished 12250000.0
                        3 7500.000000 4.0 2.0 2.0 yes 3.0 furnished 12215000.0
                        4 7420.000000 4.0 1.0 2.0 yes 2.0 furnished 11410000.0
                        5 7500.000000 3.0 3.0 1.0 yes 2.0 semi-furnished 10850000.0
                        6 8580.000000 4.0 3.0 4.0 yes 2.0 semi-furnished 10150000.0
                        7 16200.000000 5.0 3.0 NaN yes 0.0 unfurnished 10150000.0
                        8 8100.000000 4.0 1.0 2.0 yes 2.0 furnished 9870000.0
                        9 5750.000000 3.0 2.0 4.0 yes 1.0 unfurnished 9800000.0
                        10 13200.000000 3.0 1.0 2.0 yes 2.0 furnished 9800000.0
                        11 6000.000000 4.0 3.0 2.0 yes 2.0 semi-furnished 9681000.0
                        12 6550.000000 4.0 2.0 2.0 yes 1.0 semi-furnished 9310000.0
                        13 3500.000000 4.0 2.0 2.0 yes 2.0 furnished 9240000.0
                        14 7800.000000 3.0 2.0 2.0 yes 0.0 semi-furnished 9240000.0
                        15 6000.000000 4.0 1.0 2.0 yes 2.0 semi-furnished 9100000.0
                        16 6600.000000 4.0 2.0 2.0 yes 1.0 unfurnished 9100000.0
                        17 8500.000000 3.0 2.0 4.0 yes 2.0 furnished 8960000.0
                        18 5155.748593 3.0 2.0 2.0 yes 2.0 furnished 8890000.0
                        19 6420.000000 3.0 2.0 2.0 yes 1.0 semi-furnished 8855000.0
                        20 4320.000000 3.0 1.0 2.0 yes 2.0 semi-furnished 8750000.0
                        21 7155.000000 3.0 2.0 1.0 yes 2.0 unfurnished 8680000.0
                        22 8050.000000 3.0 1.0 1.0 yes 1.0 furnished 8645000.0
                        23 4560.000000 3.0 2.0 NaN yes 1.0 furnished 8645000.0
                        24 8800.000000 3.0 2.0 2.0 yes 2.0 furnished 8575000.0
                        25 6540.000000 4.0 2.0 2.0 yes 2.0 furnished 8540000.0
                        26 6000.000000 3.0 2.0 4.0 yes 0.0 semi-furnished 8463000.0
                        27 8875.000000 NaN 1.0 1.0 yes 1.0 semi-furnished 8400000.0
                        28 7950.000000 5.0 2.0 2.0 yes 2.0 unfurnished 8400000.0
                        29 5500.000000 4.0 2.0 2.0 yes 1.0 semi-furnished 8400000.0
                        30 7475.000000 3.0 2.0 4.0 yes 2.0 unfurnished 8400000.0
                        31 7000.000000 3.0 1.0 4.0 yes 2.0 semi-furnished 8400000.0
                        32 4880.000000 4.0 2.0 2.0 yes NaN furnished 8295000.0
                        33 5960.000000 3.0 3.0 2.0 yes 1.0 unfurnished 8190000.0
                        34 6840.000000 5.0 1.0 2.0 yes 1.0 furnished 8120000.0
                        35 7000.000000 3.0 2.0 4.0 yes 2.0 furnished 8080940.0
                        36 7482.000000 3.0 2.0 3.0 yes 1.0 furnished 8043000.0
                        37 9000.000000 4.0 2.0 4.0 yes 2.0 furnished 7980000.0
                        38 6000.000000 3.0 1.0 4.0 yes 2.0 unfurnished 7962500.0
                        39 6000.000000 4.0 NaN 4.0 yes 1.0 semi-furnished 7910000.0
                        [36]:
                        mean_bedrooms = df["Bedrooms"].mean()
                        [37]:
                        mean_bedrooms = math.ceil(mean_bedrooms)
                        [38]:
                        mean_bedrooms
                        [38]:
                        3
                        [39]:
                        df["Bedrooms"].replace(np.nan, mean_bedrooms, inplace=True)
                        C:\Users\SINDH\AppData\Local\Temp\ipykernel_10144\2436856013.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
                        The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
                        
                        For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
                        
                        
                          df["Bedrooms"].replace(np.nan, mean_bedrooms, inplace=True)
                        
                        [40]:
                        df[["Bedrooms", "Area"]]
                        [40]:
                        Bedrooms Area
                        0 4.0 7420.000000
                        1 3.0 8960.000000
                        2 3.0 5155.748593
                        3 4.0 7500.000000
                        4 4.0 7420.000000
                        ... ... ...
                        540 2.0 3000.000000
                        541 3.0 2400.000000
                        542 2.0 3620.000000
                        543 3.0 2910.000000
                        544 3.0 3850.000000

                        542 rows × 2 columns

                        [41]:
                        mean_bathrooms = df["Bathrooms"].mean()
                        mean_bathrooms = math.ceil(mean_bathrooms)
                        mean_bathrooms
                        [41]:
                        2
                        [42]:
                        df["Bathrooms"].replace(np.nan, mean_bathrooms, inplace=True)
                        C:\Users\SINDH\AppData\Local\Temp\ipykernel_10144\3660585759.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
                        The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
                        
                        For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
                        
                        
                          df["Bathrooms"].replace(np.nan, mean_bathrooms, inplace=True)
                        
                        [43]:
                        mean_stories = df["Stories"].mean()
                        mean_stories = math.ceil(mean_stories)
                        mean_stories
                        [43]:
                        2
                        [44]:
                        df["Stories"].replace(np.nan, mean_stories, inplace = True)
                        C:\Users\SINDH\AppData\Local\Temp\ipykernel_10144\1488270682.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
                        The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
                        
                        For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
                        
                        
                          df["Stories"].replace(np.nan, mean_stories, inplace = True)
                        
                        [45]:
                        mean_parking = df["Parking"].mean()
                        mean_parking = math.ceil(mean_parking)
                        mean_parking
                        [45]:
                        1
                        [46]:
                        df["Parking"].replace(np.nan, mean_parking, inplace=True)
                        C:\Users\SINDH\AppData\Local\Temp\ipykernel_10144\243805928.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
                        The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
                        
                        For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
                        
                        
                          df["Parking"].replace(np.nan, mean_parking, inplace=True)
                        
                        [47]:
                        print(df["Mainroad"].unique())
                        ['yes' 'no']
                        
                        [48]:
                        print(df["Furnishingstatus"].unique())
                        ['furnished' 'semi-furnished' 'unfurnished']
                        
                        [49]:
                        df["Mainroad"].value_counts()
                        [49]:
                        Mainroad
                        yes    465
                        no      77
                        Name: count, dtype: int64
                        [50]:
                        df["Furnishingstatus"].value_counts()
                        [50]:
                        Furnishingstatus
                        semi-furnished    226
                        unfurnished       178
                        furnished         138
                        Name: count, dtype: int64
                        [51]:
                        categorical_data = pd.get_dummies(df, columns = ['Mainroad', 'Furnishingstatus'])
                        [52]:
                        print(categorical_data)
                                    Area  Bedrooms  Bathrooms  Stories  Parking       Price  \
                        0    7420.000000       4.0        2.0      3.0      2.0  13300000.0   
                        1    8960.000000       3.0        4.0      4.0      3.0  12250000.0   
                        2    5155.748593       3.0        2.0      2.0      2.0  12250000.0   
                        3    7500.000000       4.0        2.0      2.0      3.0  12215000.0   
                        4    7420.000000       4.0        1.0      2.0      2.0  11410000.0   
                        ..           ...       ...        ...      ...      ...         ...   
                        540  3000.000000       2.0        1.0      1.0      2.0   1820000.0   
                        541  2400.000000       3.0        1.0      1.0      0.0   1767150.0   
                        542  3620.000000       2.0        1.0      1.0      0.0   1750000.0   
                        543  2910.000000       3.0        1.0      1.0      0.0   1750000.0   
                        544  3850.000000       3.0        1.0      2.0      0.0   1750000.0   
                        
                             Mainroad_no  Mainroad_yes  Furnishingstatus_furnished  \
                        0          False          True                        True   
                        1          False          True                        True   
                        2          False          True                       False   
                        3          False          True                        True   
                        4          False          True                        True   
                        ..           ...           ...                         ...   
                        540        False          True                       False   
                        541         True         False                       False   
                        542        False          True                       False   
                        543         True         False                        True   
                        544        False          True                       False   
                        
                             Furnishingstatus_semi-furnished  Furnishingstatus_unfurnished  
                        0                              False                         False  
                        1                              False                         False  
                        2                               True                         False  
                        3                              False                         False  
                        4                              False                         False  
                        ..                               ...                           ...  
                        540                            False                          True  
                        541                             True                         False  
                        542                            False                          True  
                        543                            False                         False  
                        544                            False                          True  
                        
                        [542 rows x 11 columns]
                        
                        [53]:
                        import matplotlib.pyplot as plt
                        import seaborn as sb
                        [63]:
                        plt.plot(df["Area"], df["Price"])
                        plt.xlabel("Area")
                        plt.ylabel("Price")
                        plt.title("Price vs Area Plot")
                        plt.show()
                        [62]:
                        plt.plot(df["Bedrooms"], df["Price"])
                        plt.xlabel("Bedrooms")
                        plt.ylabel("Price")
                        plt.title("Price vs Bedrooms Plot")
                        plt.show()
                        [61]:
                        plt.plot(df["Bedrooms"], df["Area"])
                        plt.xlabel("Bedrooms")
                        plt.ylabel("Area")
                        plt.title("Area vs Bedrooms Plot")
                        plt.show()
                        [60]:
                        plt.plot(df["Bedrooms"], df["Stories"])
                        plt.xlabel("Bedrooms")
                        plt.ylabel("Stories")
                        plt.title("Bedrooms vs Stories Plot")
                        plt.show()
                        [59]:
                        plt.plot(df["Area"], df["Stories"])
                        plt.xlabel("Area")
                        plt.ylabel("Stories")
                        plt.title("Area vs Stories Plot")
                        plt.show()
                        [64]:
                        categorical_data
                        [64]:
                        Area Bedrooms Bathrooms Stories Parking Price Mainroad_no Mainroad_yes Furnishingstatus_furnished Furnishingstatus_semi-furnished Furnishingstatus_unfurnished
                        0 7420.000000 4.0 2.0 3.0 2.0 13300000.0 False True True False False
                        1 8960.000000 3.0 4.0 4.0 3.0 12250000.0 False True True False False
                        2 5155.748593 3.0 2.0 2.0 2.0 12250000.0 False True False True False
                        3 7500.000000 4.0 2.0 2.0 3.0 12215000.0 False True True False False
                        4 7420.000000 4.0 1.0 2.0 2.0 11410000.0 False True True False False
                        ... ... ... ... ... ... ... ... ... ... ... ...
                        540 3000.000000 2.0 1.0 1.0 2.0 1820000.0 False True False False True
                        541 2400.000000 3.0 1.0 1.0 0.0 1767150.0 True False False True False
                        542 3620.000000 2.0 1.0 1.0 0.0 1750000.0 False True False False True
                        543 2910.000000 3.0 1.0 1.0 0.0 1750000.0 True False True False False
                        544 3850.000000 3.0 1.0 2.0 0.0 1750000.0 False True False False True

                        542 rows × 11 columns

                        [66]:
                        categorical_data["Furnishingstatus_furnished"]
                        [66]:
                        0       True
                        1       True
                        2      False
                        3       True
                        4       True
                               ...  
                        540    False
                        541    False
                        542    False
                        543     True
                        544    False
                        Name: Furnishingstatus_furnished, Length: 542, dtype: bool
                        [71]:
                        # Import Linear Model from scikit-learn
                        from sklearn.linear_model import LinearRegression
                        [72]:
                        # Create a Linear Regression Object
                        linear_model = LinearRegression()
                        [69]:
                        # Check for NaN or missing values for other attributes
                        print(categorical_data["Area"].isnull().sum())
                        print(categorical_data["Bedrooms"].isnull().sum())
                        print(categorical_data["Bathrooms"].isnull().sum())
                        print(categorical_data["Stories"].isnull().sum())
                        print(categorical_data["Parking"].isnull().sum())
                        print(categorical_data["Mainroad_no"].isnull().sum())
                        print(categorical_data["Mainroad_yes"].isnull().sum())
                        print(categorical_data["Furnishingstatus_furnished"].isnull().sum())
                        print(categorical_data["Furnishingstatus_semi-furnished"].isnull().sum())
                        print(categorical_data["Furnishingstatus_unfurnished"].isnull().sum())
                        print(categorical_data["Price"].isnull().sum())
                        0
                        0
                        0
                        0
                        0
                        0
                        0
                        0
                        0
                        0
                        0
                        
                        [103]:
                        X = categorical_data[["Area", "Bedrooms", "Bathrooms", "Stories", "Parking", "Mainroad_no", "Mainroad_yes", "Furnishingstatus_furnished", "Furnishingstatus_semi-furnished", "Furnishingstatus_unfurnished"]]
                        Y = categorical_data["Price"]
                        linear_model.fit(X, Y)
                        [103]:
                        LinearRegression()
                        In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
                        On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
                        LinearRegression()
                        [104]:
                        print("c_0 = ", linear_model.intercept_)
                        print("c_{1-4} = ", linear_model.coef_)
                        c_0 =  -55069.20985339861
                        c_{1-4} =  [ 2.76303444e+02  2.02463933e+05  1.12200914e+06  4.96912078e+05
                          3.47723155e+05 -3.03209898e+05  3.03209898e+05  2.70695767e+05
                          6.91308081e+04 -3.39826575e+05]
                        
                        [113]:
                        linear_model.predict(np.array( [ [7420,4,2,3,2, False, True, True, False, False] ] ))
                        C:\Users\SINDH\AppData\Roaming\Python\Python312\site-packages\sklearn\base.py:493: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names
                          warnings.warn(
                        
                        [113]:
                        array([7809064.56773169])
                        [107]:
                        from sklearn.metrics import mean_squared_error
                        [108]:
                        # Mean Squared Error (MSE)
                        Y_hat = linear_model.predict(X)
                        mean_squared_error(categorical_data["Price"], Y_hat)
                        [108]:
                        1472162563248.0195
                        [120]:
                        sb.regplot(x = "Area", y = "Price", data = categorical_data)
                        plt.ylim(0,)
                        [120]:
                        (0.0, 13877500.0)
                        [115]:
                        sb.residplot(x= "Bedrooms", y = "Price", data = categorical_data)
                        [115]:
                        <Axes: xlabel='Bedrooms', ylabel='Price'>
                        [111]:
                        sb.residplot(x= "Area", y = "Price", data = categorical_data)
                        [111]:
                        <Axes: xlabel='Area', ylabel='Price'>
                        [ ]:

                          [2]:
                          import pandas as pd
                          import numpy as np
                          [3]:
                          path = "C:/Users/SINDH/Downloads/08_DataPreProcessing_II/08_DataPreProcessing_II/04_Data.csv"
                          df = pd.read_csv(path) #read_csv() assumes data has a header
                          [4]:
                          df.head()
                          [4]:
                          Patient ID Age BMI Diagnosis Blood Pressure
                          0 P0001 64 23.416480 NaN 119/66
                          1 P0002 49 30.539825 NaN 103/62
                          2 P0003 68 31.654859 Hypertension 98/70
                          3 P0004 22 NaN Hypertension 117/87
                          4 P0005 42 NaN Hypertension NaN
                          [5]:
                          df.dtypes
                          [5]:
                          Patient ID         object
                          Age                 int64
                          BMI               float64
                          Diagnosis          object
                          Blood Pressure     object
                          dtype: object
                          [6]:
                          df.isna()
                          [6]:
                          Patient ID Age BMI Diagnosis Blood Pressure
                          0 False False False True False
                          1 False False False True False
                          2 False False False False False
                          3 False False True False False
                          4 False False True False True
                          5 False False True False False
                          6 False False False False False
                          7 False False False True False
                          8 False False False True False
                          9 False False False True True
                          10 False False True False False
                          11 False False True False False
                          12 False False True False True
                          13 False False False False False
                          14 False False True False False
                          15 False False True True True
                          16 False False True False False
                          17 False False False False True
                          18 False False False False False
                          19 False False False False True
                          20 False False False False False
                          21 False False True False True
                          22 False False False True False
                          23 False False True True False
                          24 False False True False False
                          25 False False True False False
                          26 False False False False False
                          27 False False False True True
                          28 False False False False True
                          29 False False True False False
                          [7]:
                          df.isna().sum()
                          [7]:
                          Patient ID         0
                          Age                0
                          BMI               14
                          Diagnosis          9
                          Blood Pressure     9
                          dtype: int64
                          [8]:
                          mean = df["BMI"].mean()
                          [9]:
                          df["BMI"].replace(np.nan, mean, inplace = True)
                          #newdf = df.copy()
                          C:\Users\SINDH\AppData\Local\Temp\ipykernel_10184\4099862312.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
                          The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
                          
                          For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
                          
                          
                            df["BMI"].replace(np.nan, mean, inplace = True)
                          
                          [10]:
                          df
                          [10]:
                          Patient ID Age BMI Diagnosis Blood Pressure
                          0 P0001 64 23.416480 NaN 119/66
                          1 P0002 49 30.539825 NaN 103/62
                          2 P0003 68 31.654859 Hypertension 98/70
                          3 P0004 22 27.203652 Hypertension 117/87
                          4 P0005 42 27.203652 Hypertension NaN
                          5 P0006 29 27.203652 Hypertension 139/81
                          6 P0007 21 20.154990 Diabetes 115/60
                          7 P0008 51 21.675982 NaN 137/65
                          8 P0009 42 31.285261 NaN 123/71
                          9 P0010 35 33.454012 NaN NaN
                          10 P0011 62 27.203652 Diabetes 111/84
                          11 P0012 23 27.203652 Hypertension 117/75
                          12 P0013 46 27.203652 Hypertension NaN
                          13 P0014 40 31.369513 Hypertension 102/69
                          14 P0015 27 27.203652 Diabetes 92/75
                          15 P0016 21 27.203652 NaN NaN
                          16 P0017 69 27.203652 Hypertension 131/82
                          17 P0018 43 26.752736 Hypertension NaN
                          18 P0019 23 27.885888 Hypertension 110/81
                          19 P0020 38 29.278378 Hypertension NaN
                          20 P0021 63 23.506374 Diabetes 100/89
                          21 P0022 56 27.203652 Diabetes NaN
                          22 P0023 43 29.850859 NaN 112/82
                          23 P0024 38 27.203652 NaN 92/60
                          24 P0025 35 27.203652 Hypertension 90/69
                          25 P0026 51 27.203652 Diabetes 113/70
                          26 P0027 65 23.628654 Diabetes 107/74
                          27 P0028 46 22.254053 NaN NaN
                          28 P0029 39 28.550565 Hypertension NaN
                          29 P0030 60 27.203652 Diabetes 137/75
                          [11]:
                          mean_bp = df["Blood Pressure"].mean()
                          ---------------------------------------------------------------------------
                          TypeError                                 Traceback (most recent call last)
                          Cell In[11], line 1
                          ----> 1 mean_bp = df["Blood Pressure"].mean()
                          
                          File ~\AppData\Roaming\Python\Python312\site-packages\pandas\core\series.py:6529, in Series.mean(self, axis, skipna, numeric_only, **kwargs)
                             6521 @doc(make_doc("mean", ndim=1))
                             6522 def mean(
                             6523     self,
                             (...)
                             6527     **kwargs,
                             6528 ):
                          -> 6529     return NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)
                          
                          File ~\AppData\Roaming\Python\Python312\site-packages\pandas\core\generic.py:12413, in NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)
                            12406 def mean(
                            12407     self,
                            12408     axis: Axis | None = 0,
                             (...)
                            12411     **kwargs,
                            12412 ) -> Series | float:
                          > 12413     return self._stat_function(
                            12414         "mean", nanops.nanmean, axis, skipna, numeric_only, **kwargs
                            12415     )
                          
                          File ~\AppData\Roaming\Python\Python312\site-packages\pandas\core\generic.py:12370, in NDFrame._stat_function(self, name, func, axis, skipna, numeric_only, **kwargs)
                            12366 nv.validate_func(name, (), kwargs)
                            12368 validate_bool_kwarg(skipna, "skipna", none_allowed=False)
                          > 12370 return self._reduce(
                            12371     func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
                            12372 )
                          
                          File ~\AppData\Roaming\Python\Python312\site-packages\pandas\core\series.py:6437, in Series._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
                             6432     # GH#47500 - change to TypeError to match other methods
                             6433     raise TypeError(
                             6434         f"Series.{name} does not allow {kwd_name}={numeric_only} "
                             6435         "with non-numeric dtypes."
                             6436     )
                          -> 6437 return op(delegate, skipna=skipna, **kwds)
                          
                          File ~\AppData\Roaming\Python\Python312\site-packages\pandas\core\nanops.py:147, in bottleneck_switch.__call__.<locals>.f(values, axis, skipna, **kwds)
                              145         result = alt(values, axis=axis, skipna=skipna, **kwds)
                              146 else:
                          --> 147     result = alt(values, axis=axis, skipna=skipna, **kwds)
                              149 return result
                          
                          File ~\AppData\Roaming\Python\Python312\site-packages\pandas\core\nanops.py:404, in _datetimelike_compat.<locals>.new_func(values, axis, skipna, mask, **kwargs)
                              401 if datetimelike and mask is None:
                              402     mask = isna(values)
                          --> 404 result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
                              406 if datetimelike:
                              407     result = _wrap_results(result, orig_values.dtype, fill_value=iNaT)
                          
                          File ~\AppData\Roaming\Python\Python312\site-packages\pandas\core\nanops.py:719, in nanmean(values, axis, skipna, mask)
                              716     dtype_count = dtype
                              718 count = _get_counts(values.shape, mask, axis, dtype=dtype_count)
                          --> 719 the_sum = values.sum(axis, dtype=dtype_sum)
                              720 the_sum = _ensure_numeric(the_sum)
                              722 if axis is not None and getattr(the_sum, "ndim", False):
                          
                          File ~\AppData\Roaming\Python\Python312\site-packages\numpy\core\_methods.py:49, in _sum(a, axis, dtype, out, keepdims, initial, where)
                               47 def _sum(a, axis=None, dtype=None, out=None, keepdims=False,
                               48          initial=_NoValue, where=True):
                          ---> 49     return umr_sum(a, axis, dtype, out, keepdims, initial, where)
                          
                          TypeError: can only concatenate str (not "int") to str
                          [12]:
                          df["Blood Pressure"].dtypes
                          [12]:
                          dtype('O')
                          [18]:
                          df["Blood Pressure"] = 2 * df["Age"] + 65
                          df.rename(columns = {'Blooad Pressure': 'Blood Pressure Fill'}, inplace = True)
                          [19]:
                          df.head()
                          [19]:
                          Patient ID Age BMI Diagnosis Blood Pressure
                          0 P0001 64 23.416480 NaN 193
                          1 P0002 49 30.539825 NaN 163
                          2 P0003 68 31.654859 Hypertension 201
                          3 P0004 22 27.203652 Hypertension 109
                          4 P0005 42 27.203652 Hypertension 149
                          [20]:
                          df.tail()
                          [20]:
                          Patient ID Age BMI Diagnosis Blood Pressure
                          25 P0026 51 27.203652 Diabetes 167
                          26 P0027 65 23.628654 Diabetes 195
                          27 P0028 46 22.254053 NaN 157
                          28 P0029 39 28.550565 Hypertension 143
                          29 P0030 60 27.203652 Diabetes 185
                          [21]:
                          df["Diagnosis"].dtypes
                          [21]:
                          dtype('O')
                          [22]:
                          df["BMI"]
                          [22]:
                          0     23.416480
                          1     30.539825
                          2     31.654859
                          3     27.203652
                          4     27.203652
                          5     27.203652
                          6     20.154990
                          7     21.675982
                          8     31.285261
                          9     33.454012
                          10    27.203652
                          11    27.203652
                          12    27.203652
                          13    31.369513
                          14    27.203652
                          15    27.203652
                          16    27.203652
                          17    26.752736
                          18    27.885888
                          19    29.278378
                          20    23.506374
                          21    27.203652
                          22    29.850859
                          23    27.203652
                          24    27.203652
                          25    27.203652
                          26    23.628654
                          27    22.254053
                          28    28.550565
                          29    27.203652
                          Name: BMI, dtype: float64
                          [23]:
                          df["BMI"] = (df["BMI"] - df["BMI"].min()) / (df["BMI"].max() - df["BMI"].min())
                          [24]:
                          df["BMI"]
                          [24]:
                          0     0.245243
                          1     0.780872
                          2     0.864715
                          3     0.530014
                          4     0.530014
                          5     0.530014
                          6     0.000000
                          7     0.114369
                          8     0.836924
                          9     1.000000
                          10    0.530014
                          11    0.530014
                          12    0.530014
                          13    0.843259
                          14    0.530014
                          15    0.530014
                          16    0.530014
                          17    0.496108
                          18    0.581313
                          19    0.686019
                          20    0.252002
                          21    0.530014
                          22    0.729066
                          23    0.530014
                          24    0.530014
                          25    0.530014
                          26    0.261197
                          27    0.157836
                          28    0.631293
                          29    0.530014
                          Name: BMI, dtype: float64
                          [25]:
                          print(df["BMI"].max(), df["BMI"].min())
                          1.0 0.0
                          
                          [32]:
                          bins = np.linspace(min(df["Age"]), max(df["Age"]), 4)
                          [33]:
                          bins
                          [33]:
                          array([21., 37., 53., 69.])
                          [36]:
                          group_names = ["Adults", "Mature", "Old Home Age"]
                          [37]:
                          df["Age-binned"] = pd.cut(df["Age"], bins, labels = group_names, include_lowest = True)
                          [38]:
                          df["Age-binned"]
                          [38]:
                          0     Old Home Age
                          1           Mature
                          2     Old Home Age
                          3           Adults
                          4           Mature
                          5           Adults
                          6           Adults
                          7           Mature
                          8           Mature
                          9           Adults
                          10    Old Home Age
                          11          Adults
                          12          Mature
                          13          Mature
                          14          Adults
                          15          Adults
                          16    Old Home Age
                          17          Mature
                          18          Adults
                          19          Mature
                          20    Old Home Age
                          21    Old Home Age
                          22          Mature
                          23          Mature
                          24          Adults
                          25          Mature
                          26    Old Home Age
                          27          Mature
                          28          Mature
                          29    Old Home Age
                          Name: Age-binned, dtype: category
                          Categories (3, object): ['Adults' < 'Mature' < 'Old Home Age']
                          [39]:
                          df.head(10)
                          [39]:
                          Patient ID Age BMI Diagnosis Blood Pressure Age-binned
                          0 P0001 64 0.245243 NaN 193 Old Home Age
                          1 P0002 49 0.780872 NaN 163 Mature
                          2 P0003 68 0.864715 Hypertension 201 Old Home Age
                          3 P0004 22 0.530014 Hypertension 109 Adults
                          4 P0005 42 0.530014 Hypertension 149 Mature
                          5 P0006 29 0.530014 Hypertension 123 Adults
                          6 P0007 21 0.000000 Diabetes 107 Adults
                          7 P0008 51 0.114369 NaN 167 Mature
                          8 P0009 42 0.836924 NaN 149 Mature
                          9 P0010 35 1.000000 NaN 135 Adults
                          [40]:
                          pd.get_dummies(df["Diagnosis"])
                          [40]:
                          Diabetes Hypertension
                          0 False False
                          1 False False
                          2 False True
                          3 False True
                          4 False True
                          5 False True
                          6 True False
                          7 False False
                          8 False False
                          9 False False
                          10 True False
                          11 False True
                          12 False True
                          13 False True
                          14 True False
                          15 False False
                          16 False True
                          17 False True
                          18 False True
                          19 False True
                          20 True False
                          21 True False
                          22 False False
                          23 False False
                          24 False True
                          25 True False
                          26 True False
                          27 False False
                          28 False True
                          29 True False
                          [ ]:

                            # K-Means Clustering

                            K-Means Clustering¶

                            ## Problem Statement

                            - The following features are available for California houses in a specific locality obtained from 1990 census data;
                            - Longitude
                            - Latitude
                            - Housing Median Age
                            - Total Rooms
                            - Total Bedrooms
                            - Population
                            - Households
                            - Median Income
                            - Median House Value
                            - Ocean Proximity

                            - Create clusters/groups of houses based on selected set of features.

                            Problem Statement¶

                            • The following features are available for California houses in a specific locality obtained from 1990 census data;

                              • Longitude
                              • Latitude
                              • Housing Median Age
                              • Total Rooms
                              • Total Bedrooms
                              • Population
                              • Households
                              • Median Income
                              • Median House Value
                              • Ocean Proximity
                            • Create clusters/groups of houses based on selected set of features.

                            ## Acknowledgement / Source

                            - Data
                            - https://www.kaggle.com/datasets/camnugent/california-housing-prices/
                            - Code
                            - https://www.datacamp.com/tutorial/k-means-clustering-python

                            Acknowledgement / Source¶

                            • Data
                              • https://www.kaggle.com/datasets/camnugent/california-housing-prices/
                            • Code
                              • https://www.datacamp.com/tutorial/k-means-clustering-python
                            ## Importing Libraries

                            Importing Libraries¶

                            [1]:
                            import pandas as pd
                            import seaborn as sns

                            from sklearn.model_selection import train_test_split
                            from sklearn import preprocessing
                            from sklearn.cluster import KMeans
                            from sklearn.metrics import silhouette_score
                            ## Loading the Dataset

                            Loading the Dataset¶

                            [2]:
                            home_data = pd.read_csv('Data/CaliforniaHousingPrices.csv')
                            home_data.head()
                            [2]:
                            longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
                            0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
                            1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
                            2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
                            3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
                            4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
                            [3]:
                            # Select only 3 features for our case study namely longtude, Latitude and Median House Value

                            home_data = home_data[['longitude', 'latitude', 'median_house_value']]

                            home_data.head()
                            [3]:
                            longitude latitude median_house_value
                            0 -122.23 37.88 452600.0
                            1 -122.22 37.86 358500.0
                            2 -122.24 37.85 352100.0
                            3 -122.25 37.85 341300.0
                            4 -122.25 37.85 342200.0
                            [4]:
                            home_data.shape
                            [4]:
                            (20640, 3)
                            ## Visualize the Data

                            Visualize the Data¶

                            [5]:
                            # 'median_house_value' column is used to color-code the data points
                            sns.scatterplot(data = home_data, x = 'longitude', y = 'latitude', hue = 'median_house_value')
                            [5]:
                            <Axes: xlabel='longitude', ylabel='latitude'>
                            ## Pre-Processing

                            Pre-Processing¶

                            [ ]:
                            #from sklearn.model_selection import train_test_split

                            #X_train, X_test, y_train, y_test = train_test_split(home_data[['latitude', 'longitude']], home_data[['median_house_value']], test_size=0.33, random_state=0)
                            [6]:
                            #from sklearn import preprocessing

                            X = home_data[['latitude', 'longitude']]

                            X_norm = preprocessing.normalize(X)
                            ## Model

                            Model¶

                            [16]:
                            #from sklearn.cluster import KMeans

                            kmeans = KMeans(n_clusters = 3, random_state = 0, n_init='auto')
                            kmeans.fit(X_norm)
                            [16]:
                            KMeans(n_clusters=3, n_init='auto', random_state=0)
                            In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
                            On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
                            KMeans(n_clusters=3, n_init='auto', random_state=0)
                            [13]:
                            sns.scatterplot(data = X, x = 'longitude', y = 'latitude', hue = kmeans.labels_)
                            [13]:
                            <Axes: xlabel='longitude', ylabel='latitude'>
                            [17]:
                            house_values = home_data['median_house_value']
                            sns.boxplot(x = kmeans.labels_, y = house_values)
                            [17]:
                            <Axes: ylabel='median_house_value'>
                            ### Silhouette (si·loo·**et**) Score

                            - **Scores closer to 1**: Indicate well-separated clusters, suggesting the clustering is likely effective in capturing the underlying structure in the data.
                            - **Scores around 0**: Indicate clusters with some overlap, and you might consider adjusting the number of clusters or the clustering algorithm to see if you can achieve better separation.
                            - **Negative scores**: Suggest that some data points are potentially assigned to the wrong cluster, and you might need to explore alternative clustering strategies.

                            Silhouette (si·loo·et) Score¶

                            • Scores closer to 1: Indicate well-separated clusters, suggesting the clustering is likely effective in capturing the underlying structure in the data.
                            • Scores around 0: Indicate clusters with some overlap, and you might consider adjusting the number of clusters or the clustering algorithm to see if you can achieve better separation.
                            • Negative scores: Suggest that some data points are potentially assigned to the wrong cluster, and you might need to explore alternative clustering strategies.
                            [15]:
                            #from sklearn.metrics import silhouette_score
                            silhouette_score(X_norm, kmeans.labels_, metric='euclidean')
                            [15]:
                            0.7761558886704949
                            ## Choosing the Number of Clusters

                            Choosing the Number of Clusters¶

                            [22]:
                            K = range(2, 8)

                            for k in K:
                            print(k)
                            2
                            3
                            4
                            5
                            6
                            7
                            
                            [23]:
                            K = range(2, 8)
                            fits = []
                            score = []


                            for k in K:
                            # train the model for current value of k on training data
                            model = KMeans(n_clusters = k, random_state = 0, n_init='auto').fit(X_norm)
                            # append the model to fits
                            fits.append(model)
                            # Append the silhouette score to scores
                            score.append(silhouette_score(X_norm, model.labels_, metric='euclidean'))
                            [24]:
                            sns.scatterplot(data = X, x = 'longitude', y = 'latitude', hue = fits[0].labels_)
                            [24]:
                            <Axes: xlabel='longitude', ylabel='latitude'>
                            [25]:
                            sns.scatterplot(data = X, x = 'longitude', y = 'latitude', hue = fits[2].labels_)
                            [25]:
                            <Axes: xlabel='longitude', ylabel='latitude'>
                            [26]:
                            sns.scatterplot(data = X, x = 'longitude', y = 'latitude', hue = fits[5].labels_)
                            [26]:
                            <Axes: xlabel='longitude', ylabel='latitude'>
                            [27]:
                            sns.lineplot(x = K, y = score)
                            [27]:
                            <Axes: >
                            [28]:
                            sns.scatterplot(data = X, x = 'longitude', y = 'latitude', hue = fits[3].labels_)
                            [28]:
                            <Axes: xlabel='longitude', ylabel='latitude'>
                            [29]:
                            sns.boxplot(x = fits[3].labels_, y = house_values)
                            [29]:
                            <Axes: ylabel='median_house_value'>
                            [30]:
                            home_data = pd.read_csv('Data/CaliforniaHousingPrices.csv')
                            home_data.head()
                            [30]:
                            longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
                            0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
                            1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
                            2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
                            3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
                            4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
                            ## Assignment

                            - Identify among the following, which columns have missing values.
                            - housing_median_age
                            - total_rooms
                            - total_bedrooms
                            - population
                            - Handle the missing values
                            - Normalize the data
                            - Cluster the data into 4 classes
                            - Using PCA reduce the dimesions from 4 to 2
                            - Visualize the original clusters using the 2 dimension obtained via PCA

                            Assignment¶

                            • Identify among the following, which columns have missing values.
                              • housing_median_age
                              • total_rooms
                              • total_bedrooms
                              • population
                            • Handle the missing values
                            • Normalize the data
                            • Cluster the data into 4 classes
                            • Using PCA reduce the dimesions from 4 to 2
                            • Visualize the original clusters using the 2 dimension obtained via PCA
                            [33]:
                            selected_features = home_data[ ["housing_median_age", "total_rooms", "total_bedrooms", "population"] ]

                            selected_features.describe()
                            [33]:
                            housing_median_age total_rooms total_bedrooms population
                            count 20640.000000 20640.000000 20433.000000 20640.000000
                            mean 28.639486 2635.763081 537.870553 1425.476744
                            std 12.585558 2181.615252 421.385070 1132.462122
                            min 1.000000 2.000000 1.000000 3.000000
                            25% 18.000000 1447.750000 296.000000 787.000000
                            50% 29.000000 2127.000000 435.000000 1166.000000
                            75% 37.000000 3148.000000 647.000000 1725.000000
                            max 52.000000 39320.000000 6445.000000 35682.000000
                            [ ]:
                            # Idenitfy missing values columns and handle them from the selected_features


                            [ ]:
                            # Normalize the values
                            selected_features_normalized =

                            [ ]:
                            # Cluster the data into 4 classes
                            kmeans = KMeans(n_clusters = 4, random_state = 0, n_init='auto')
                            kmeans.fit(selected_features_normalized)
                            [ ]:
                            # Using PCA to convert selected_features into two dimensions


                            [ ]:
                            # Visualize the 4 clusters using the 2 PCA features



                            [ ]:

                            [ ]:

                            [ ]:

                              [1]:
                              import pandas as pd
                              C:\Users\SINDH\AppData\Local\Temp\ipykernel_8148\4080736814.py:1: DeprecationWarning: 
                              Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
                              (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
                              but was not found to be installed on your system.
                              If this would cause problems for you,
                              please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
                                      
                                import pandas as pd
                              
                              [1]:
                              import pandas as pd
                              pd.options.mode.chained_assignment = None # default='warn'
                              import numpy as np
                              [2]:
                              path = "C:/Users/SINDH/Downloads/09_Exploratory_Data_Analysis/09_Exploratory_Data_Analysis/Auto85.csv"
                              df = pd.read_csv(path, header = None) #read_csv() assumes data has a header
                              [3]:
                              df.head()
                              [3]:
                              0 1 2 3 4 5 6 7 8 9 ... 16 17 18 19 20 21 22 23 24 25
                              0 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495
                              1 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500
                              2 1 ? alfa-romero gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500
                              3 2 164 audi gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950
                              4 2 164 audi gas std four sedan 4wd front 99.4 ... 136 mpfi 3.19 3.40 8.0 115 5500 18 22 17450

                              5 rows × 26 columns

                              [4]:
                              headers = ["symboling","normalized-losses","make","fuel-type", "aspiration", "num-of-doors", "body-style", "drive-wheels","engine-location", "wheel-base", "length", "width", "height", "curb-weight", "engine-type", "num-of-cylinders", "engine-size", "fuel-system", "bore", "stroke", "compression-ratio", "horsepower", "peak-rpm", "city-mpg", "highway-mpg", "Price"]
                              df.columns = headers
                              [5]:
                              df.head()
                              [5]:
                              symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base ... engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg Price
                              0 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495
                              1 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500
                              2 1 ? alfa-romero gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500
                              3 2 164 audi gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950
                              4 2 164 audi gas std four sedan 4wd front 99.4 ... 136 mpfi 3.19 3.40 8.0 115 5500 18 22 17450

                              5 rows × 26 columns

                              [6]:
                              df.head(7)
                              [6]:
                              symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base ... engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg Price
                              0 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495
                              1 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500
                              2 1 ? alfa-romero gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500
                              3 2 164 audi gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950
                              4 2 164 audi gas std four sedan 4wd front 99.4 ... 136 mpfi 3.19 3.40 8.0 115 5500 18 22 17450
                              5 2 ? audi gas std two sedan fwd front 99.8 ... 136 mpfi 3.19 3.40 8.5 110 5500 19 25 15250
                              6 1 158 audi gas std four sedan fwd front 105.8 ... 136 mpfi 3.19 3.40 8.5 110 5500 19 25 17710

                              7 rows × 26 columns

                              [7]:
                              df.dtypes
                              [7]:
                              symboling              int64
                              normalized-losses     object
                              make                  object
                              fuel-type             object
                              aspiration            object
                              num-of-doors          object
                              body-style            object
                              drive-wheels          object
                              engine-location       object
                              wheel-base           float64
                              length               float64
                              width                float64
                              height               float64
                              curb-weight            int64
                              engine-type           object
                              num-of-cylinders      object
                              engine-size            int64
                              fuel-system           object
                              bore                  object
                              stroke                object
                              compression-ratio    float64
                              horsepower            object
                              peak-rpm              object
                              city-mpg               int64
                              highway-mpg            int64
                              Price                 object
                              dtype: object
                              [8]:
                              df["normalized-losses"]
                              [8]:
                              0        ?
                              1        ?
                              2        ?
                              3      164
                              4      164
                                    ... 
                              200     95
                              201     95
                              202     95
                              203     95
                              204     95
                              Name: normalized-losses, Length: 205, dtype: object
                              [9]:
                              df["normalized-losses"].replace("?", np.nan, inplace = True)
                              [10]:
                              df["normalized-losses"] = pd.to_numeric(df["normalized-losses"])
                              [11]:
                              df.dtypes
                              df["normalized-losses"]
                              [11]:
                              0        NaN
                              1        NaN
                              2        NaN
                              3      164.0
                              4      164.0
                                     ...  
                              200     95.0
                              201     95.0
                              202     95.0
                              203     95.0
                              204     95.0
                              Name: normalized-losses, Length: 205, dtype: float64
                              [12]:

                              mean = df["normalized-losses"].mean()
                              [13]:
                              mean
                              [13]:
                              122.0
                              [14]:
                              df["normalized-losses"].replace(np.nan, mean, inplace=True)
                              [15]:
                              df["normalized-losses"]
                              [15]:
                              0      122.0
                              1      122.0
                              2      122.0
                              3      164.0
                              4      164.0
                                     ...  
                              200     95.0
                              201     95.0
                              202     95.0
                              203     95.0
                              204     95.0
                              Name: normalized-losses, Length: 205, dtype: float64
                              [16]:
                              df["normalized-losses"].dtypes
                              [16]:
                              dtype('float64')
                              [17]:
                              df[["normalized-losses", "make"]]
                              [17]:
                              normalized-losses make
                              0 122.0 alfa-romero
                              1 122.0 alfa-romero
                              2 122.0 alfa-romero
                              3 164.0 audi
                              4 164.0 audi
                              ... ... ...
                              200 95.0 volvo
                              201 95.0 volvo
                              202 95.0 volvo
                              203 95.0 volvo
                              204 95.0 volvo

                              205 rows × 2 columns

                              [18]:
                              df[["normalized-losses" , "make"]]
                              [18]:
                              normalized-losses make
                              0 122.0 alfa-romero
                              1 122.0 alfa-romero
                              2 122.0 alfa-romero
                              3 164.0 audi
                              4 164.0 audi
                              ... ... ...
                              200 95.0 volvo
                              201 95.0 volvo
                              202 95.0 volvo
                              203 95.0 volvo
                              204 95.0 volvo

                              205 rows × 2 columns

                              [19]:
                              df.groupby("make")["normalized-losses"].mean()
                              [19]:
                              make
                              alfa-romero      122.000000
                              audi             144.285714
                              bmw              156.000000
                              chevrolet        100.000000
                              dodge            133.444444
                              honda            103.000000
                              isuzu            122.000000
                              jaguar           129.666667
                              mazda            123.705882
                              mercedes-benz    110.000000
                              mercury          122.000000
                              mitsubishi       140.615385
                              nissan           135.166667
                              peugot           146.818182
                              plymouth         128.000000
                              porsche          134.800000
                              renault          122.000000
                              saab             127.000000
                              subaru            92.250000
                              toyota           110.656250
                              volkswagen       121.500000
                              volvo             91.454545
                              Name: normalized-losses, dtype: float64
                              [20]:
                              df.head()
                              [20]:
                              symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base ... engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg Price
                              0 3 122.0 alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495
                              1 3 122.0 alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500
                              2 1 122.0 alfa-romero gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500
                              3 2 164.0 audi gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950
                              4 2 164.0 audi gas std four sedan 4wd front 99.4 ... 136 mpfi 3.19 3.40 8.0 115 5500 18 22 17450

                              5 rows × 26 columns

                              [21]:
                              df["wheel-base"].dtypes
                              [21]:
                              dtype('float64')
                              [22]:
                              df["wheel-base"]
                              [22]:
                              0       88.6
                              1       88.6
                              2       94.5
                              3       99.8
                              4       99.4
                                     ...  
                              200    109.1
                              201    109.1
                              202    109.1
                              203    109.1
                              204    109.1
                              Name: wheel-base, Length: 205, dtype: float64
                              [23]:
                              df.tail()
                              [23]:
                              symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base ... engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg Price
                              200 -1 95.0 volvo gas std four sedan rwd front 109.1 ... 141 mpfi 3.78 3.15 9.5 114 5400 23 28 16845
                              201 -1 95.0 volvo gas turbo four sedan rwd front 109.1 ... 141 mpfi 3.78 3.15 8.7 160 5300 19 25 19045
                              202 -1 95.0 volvo gas std four sedan rwd front 109.1 ... 173 mpfi 3.58 2.87 8.8 134 5500 18 23 21485
                              203 -1 95.0 volvo diesel turbo four sedan rwd front 109.1 ... 145 idi 3.01 3.40 23.0 106 4800 26 27 22470
                              204 -1 95.0 volvo gas turbo four sedan rwd front 109.1 ... 141 mpfi 3.78 3.15 9.5 114 5400 19 25 22625

                              5 rows × 26 columns

                              [24]:
                              df["city-mpg"]
                              [24]:
                              0      21
                              1      21
                              2      19
                              3      24
                              4      18
                                     ..
                              200    23
                              201    19
                              202    18
                              203    26
                              204    19
                              Name: city-mpg, Length: 205, dtype: int64
                              [25]:
                              df["city-mpg"] = 235/df["city-mpg"]
                              [26]:
                              df["city-mpg"]
                              [26]:
                              0      11.190476
                              1      11.190476
                              2      12.368421
                              3       9.791667
                              4      13.055556
                                       ...    
                              200    10.217391
                              201    12.368421
                              202    13.055556
                              203     9.038462
                              204    12.368421
                              Name: city-mpg, Length: 205, dtype: float64
                              [27]:
                              df.rename(columns = {"city-mpg": "c-L/100Km"}, inplace = True)
                              [28]:
                              df["c-L/100Km"]
                              [28]:
                              0      11.190476
                              1      11.190476
                              2      12.368421
                              3       9.791667
                              4      13.055556
                                       ...    
                              200    10.217391
                              201    12.368421
                              202    13.055556
                              203     9.038462
                              204    12.368421
                              Name: c-L/100Km, Length: 205, dtype: float64
                              [29]:
                              df.head()
                              [29]:
                              symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base ... engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm c-L/100Km highway-mpg Price
                              0 3 122.0 alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 11.190476 27 13495
                              1 3 122.0 alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 11.190476 27 16500
                              2 1 122.0 alfa-romero gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0 154 5000 12.368421 26 16500
                              3 2 164.0 audi gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10.0 102 5500 9.791667 30 13950
                              4 2 164.0 audi gas std four sedan 4wd front 99.4 ... 136 mpfi 3.19 3.40 8.0 115 5500 13.055556 22 17450

                              5 rows × 26 columns

                              [30]:
                              df["Price"].dtypes
                              [30]:
                              dtype('O')
                              [31]:
                              df["Price"].replace("?", np.nan, inplace = True)
                              df["Price"] = pd.to_numeric(df["Price"])
                              C:\Users\SINDH\AppData\Local\Temp\ipykernel_3076\2782822628.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
                              The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
                              
                              For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
                              
                              
                                df["Price"].replace("?", np.nan, inplace = True)
                              
                              [92]:
                              df["Price"]
                              [92]:
                              0      13495.0
                              1      16500.0
                              2      16500.0
                              3      13950.0
                              4      17450.0
                                      ...   
                              200    16845.0
                              201    19045.0
                              202    21485.0
                              203    22470.0
                              204    22625.0
                              Name: Price, Length: 205, dtype: float64
                              [93]:
                              df["Price"].dtypes
                              [93]:
                              dtype('float64')
                              [34]:
                              df["highway-mpg"] = 235/df["highway-mpg"]
                              df.rename(columns={'highway-mpg': 'h-L/100Km'}, inplace=True)
                              [35]:
                              df["h-L/100Km"]
                              [35]:
                              0       8.703704
                              1       8.703704
                              2       9.038462
                              3       7.833333
                              4      10.681818
                                       ...    
                              200     8.392857
                              201     9.400000
                              202    10.217391
                              203     8.703704
                              204     9.400000
                              Name: h-L/100Km, Length: 205, dtype: float64
                              [36]:
                              df.head()
                              [36]:
                              symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base ... engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm c-L/100Km h-L/100Km Price
                              0 3 122.0 alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 11.190476 8.703704 13495
                              1 3 122.0 alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 11.190476 8.703704 16500
                              2 1 122.0 alfa-romero gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0 154 5000 12.368421 9.038462 16500
                              3 2 164.0 audi gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10.0 102 5500 9.791667 7.833333 13950
                              4 2 164.0 audi gas std four sedan 4wd front 99.4 ... 136 mpfi 3.19 3.40 8.0 115 5500 13.055556 10.681818 17450

                              5 rows × 26 columns

                              [37]:
                              df["length"]
                              [37]:
                              0      168.8
                              1      168.8
                              2      171.2
                              3      176.6
                              4      176.6
                                     ...  
                              200    188.8
                              201    188.8
                              202    188.8
                              203    188.8
                              204    188.8
                              Name: length, Length: 205, dtype: float64
                              [38]:
                              df["length"] = df["length"]/df["length"].max()
                              [39]:
                              df["length"]
                              [39]:
                              0      0.811148
                              1      0.811148
                              2      0.822681
                              3      0.848630
                              4      0.848630
                                       ...   
                              200    0.907256
                              201    0.907256
                              202    0.907256
                              203    0.907256
                              204    0.907256
                              Name: length, Length: 205, dtype: float64
                              [40]:
                              df["length"].max()
                              [40]:
                              1.0
                              [41]:
                              df["width"]
                              [41]:
                              0      64.1
                              1      64.1
                              2      65.5
                              3      66.2
                              4      66.4
                                     ... 
                              200    68.9
                              201    68.8
                              202    68.9
                              203    68.9
                              204    68.9
                              Name: width, Length: 205, dtype: float64
                              [42]:
                              print(df["width"].min() , df["width"].max())
                              60.3 72.3
                              
                              [43]:
                              df["width"] = (df["width"] - df["width"].min()) / (df["width"].max() - df["width"].min())
                              [44]:
                              df["width"]
                              [44]:
                              0      0.316667
                              1      0.316667
                              2      0.433333
                              3      0.491667
                              4      0.508333
                                       ...   
                              200    0.716667
                              201    0.708333
                              202    0.716667
                              203    0.716667
                              204    0.716667
                              Name: width, Length: 205, dtype: float64
                              [45]:
                              df["width"].max()
                              [45]:
                              1.0
                              [46]:
                              df["width"].min()
                              [46]:
                              0.0
                              [47]:
                              df["height"] = (df["height"] - df["height"].mean()) / df["height"].std()
                              [48]:
                              print(df["height"], df["height"].min(), df["height"].max())
                              0     -2.015483
                              1     -2.015483
                              2     -0.542200
                              3      0.235366
                              4      0.235366
                                       ...   
                              200    0.726460
                              201    0.726460
                              202    0.726460
                              203    0.726460
                              204    0.726460
                              Name: height, Length: 205, dtype: float64 -2.4247287815509493 2.486215399755926
                              
                              [49]:
                              df["height"].min()
                              [49]:
                              -2.4247287815509493
                              [55]:
                              df["Price"].dtypes
                              [55]:
                              dtype('O')
                              [51]:
                              df["Price"].max()
                              [51]:
                              '?'
                              [52]:
                              df["Price"]
                              [52]:
                              0      13495
                              1      16500
                              2      16500
                              3      13950
                              4      17450
                                     ...  
                              200    16845
                              201    19045
                              202    21485
                              203    22470
                              204    22625
                              Name: Price, Length: 205, dtype: object
                              [53]:
                              df["width"]
                              [53]:
                              0      0.316667
                              1      0.316667
                              2      0.433333
                              3      0.491667
                              4      0.508333
                                       ...   
                              200    0.716667
                              201    0.708333
                              202    0.716667
                              203    0.716667
                              204    0.716667
                              Name: width, Length: 205, dtype: float64
                              [54]:
                              bins = np.linspace(min(df["Price"]), max(df["Price"]), 4)
                              ---------------------------------------------------------------------------
                              UFuncTypeError                            Traceback (most recent call last)
                              Cell In[54], line 1
                              ----> 1 bins = np.linspace(min(df["Price"]), max(df["Price"]), 4)
                              
                              File ~\AppData\Roaming\Python\Python312\site-packages\numpy\core\function_base.py:129, in linspace(start, stop, num, endpoint, retstep, dtype, axis)
                                  125 div = (num - 1) if endpoint else num
                                  127 # Convert float/complex array scalars to float, gh-3504
                                  128 # and make sure one can use variables that have an __array_interface__, gh-6634
                              --> 129 start = asanyarray(start) * 1.0
                                  130 stop  = asanyarray(stop)  * 1.0
                                  132 dt = result_type(start, stop, float(num))
                              
                              UFuncTypeError: ufunc 'multiply' did not contain a loop with signature matching types (dtype('<U5'), dtype('float64')) -> None
                              [87]:
                              bins
                              [87]:
                              array([0.        , 0.33333333, 0.66666667, 1.        ])
                              [89]:
                              group_names = ["Low", "Medium", "High"]
                              [90]:
                              df["width-binned"] = pd.cut(df["width"], bins, labels = group_names, include_lowest = True)
                              [91]:
                              df[["width" , "width-binned"]]
                              [91]:
                              width width-binned
                              0 0.316667 Low
                              1 0.316667 Low
                              2 0.433333 Medium
                              3 0.491667 Medium
                              4 0.508333 Medium
                              ... ... ...
                              200 0.716667 High
                              201 0.708333 High
                              202 0.716667 High
                              203 0.716667 High
                              204 0.716667 High

                              205 rows × 2 columns

                              [93]:
                              pd.get_dummies(df["fuel-type"])
                              [93]:
                              diesel gas
                              0 False True
                              1 False True
                              2 False True
                              3 False True
                              4 False True
                              ... ... ...
                              200 False True
                              201 False True
                              202 False True
                              203 True False
                              204 False True

                              205 rows × 2 columns

                              [ ]:

                              • 09_Exploratory_Data_Analysis.ipynb
                              • Untitled7.ipynb
                              • Untitled6.ipynb
                              • Untitled4.ipynb
                              • 10_Model_Development_Linear_Regression.ipynb
                              • 02_LogisticRegression.ipynb
                              • Untitled9.ipynb
                              • 04_KMeansClustering.ipynb
                              • 04_KMeansClustering_1.ipynb
                              • Assignment03.ipynb
                              Common Tools
                              No metadata.
                              Advanced Tools
                              No metadata.

                              -

                              Variables

                              Callstack

                                Breakpoints

                                Source

                                9
                                1

                                Kernel Sources

                                  0
                                  10
                                  Python 3 (ipykernel) | Idle
                                  Uploading…

                                  1
                                  Assignment03.ipynb
                                  Spaces: 4
                                  Ln 1, Col 1
                                  Mode: Command
                                  Python 3 (ipykernel)
                                  Kernel status: Idle
                                  Alt+[
                                  Alt+]
                                  Alt+End
                                  Python 3 (ipykernel)
                                  Kernel status: Idle
                                  Alt+[
                                  Alt+]
                                  Alt+End
                                  Python 3 (ipykernel)
                                  Kernel status: Idle
                                  Alt+[
                                  Alt+]
                                  Alt+End
                                  Python 3 (ipykernel)
                                  Kernel status: Idle
                                  Alt+[
                                  Alt+]
                                  Alt+End
                                  Alt+[
                                  Alt+]
                                  Alt+End
                                  Python 3 (ipykernel)
                                  Kernel status: Idle
                                  Alt+[
                                  Alt+]
                                  Alt+End
                                  Python 3 (ipykernel)
                                  Kernel status: Idle
                                  Alt+[
                                  Alt+]
                                  Alt+End
                                  Python 3 (ipykernel)
                                  Kernel status: Idle
                                  Alt+[
                                  Alt+]
                                  Alt+End
                                  Python 3 (ipykernel)
                                  Kernel status: Idle
                                  Alt+[
                                  Alt+]
                                  Alt+End
                                  Python 3 (ipykernel)
                                  Kernel status: Idle
                                  Alt+[
                                  Alt+]
                                  Alt+End
                                  • Console
                                  • Change Kernel…
                                  • Clear Console Cells
                                  • Close and Shut Down…
                                  • Insert Line Break
                                  • Interrupt Kernel
                                  • New Console
                                  • Restart Kernel…
                                  • Run Cell (forced)
                                  • Run Cell (unforced)
                                  • Show All Kernel Activity
                                  • Debugger
                                  • Breakpoints on exception
                                  • Evaluate Code
                                    Evaluate Code
                                  • Next
                                    Next
                                    F10
                                  • Pause
                                    Pause
                                    F9
                                  • Step In
                                    Step In
                                    F11
                                  • Step Out
                                    Step Out
                                    Shift+F11
                                  • Terminate
                                    Terminate
                                    Shift+F9
                                  • Display Languages
                                  • English
                                    English
                                  • Extension Manager
                                  • Enable Extension Manager
                                  • File Operations
                                  • Autosave Documents
                                  • Download
                                    Download the file to your computer
                                  • Duplicate Notebook
                                  • Open from Path…
                                    Open from path
                                  • Open from URL…
                                    Open from URL
                                  • Reload Notebook from Disk
                                    Reload contents from disk
                                  • Revert Notebook to Checkpoint…
                                    Revert contents to previous checkpoint
                                  • Save Notebook
                                    Save and create checkpoint
                                    Ctrl+S
                                  • Save Notebook As…
                                    Save with new path
                                    Ctrl+Shift+S
                                  • Show Active File in File Browser
                                  • Trust HTML File
                                    Whether the HTML file is trusted. Trusting the file allows scripts to run in it, which may result in security risks. Only enable for files you trust.
                                  • Help
                                  • About JupyterLab
                                  • Jupyter Forum
                                  • Jupyter Reference
                                  • JupyterLab FAQ
                                  • JupyterLab Reference
                                  • Licenses
                                  • Markdown Reference
                                  • Reset Application State
                                  • Show Keyboard Shortcuts
                                    Show relevant keyboard shortcuts for the current active widget
                                    Ctrl+Shift+H
                                  • Image Viewer
                                  • Flip image horizontally
                                    H
                                  • Flip image vertically
                                    V
                                  • Invert Colors
                                    I
                                  • Reset Image
                                    0
                                  • Rotate Clockwise
                                    ]
                                  • Rotate Counterclockwise
                                    [
                                  • Zoom In
                                    =
                                  • Zoom Out
                                    -
                                  • Kernel Operations
                                  • Shut Down All Kernels…
                                  • Launcher
                                  • New Launcher
                                    Ctrl+Shift+L
                                  • Main Area
                                  • Activate Next Tab
                                    Ctrl+Shift+]
                                  • Activate Next Tab Bar
                                    Ctrl+Shift+.
                                  • Activate Previous Tab
                                    Ctrl+Shift+[
                                  • Activate Previous Tab Bar
                                    Ctrl+Shift+,
                                  • Activate Previously Used Tab
                                    Ctrl+Shift+'
                                  • Close All Other Tabs
                                  • Close All Tabs
                                  • Close Tab
                                    Alt+W
                                  • Close Tabs to Right
                                  • End Search
                                    Esc
                                  • Find Next
                                    Ctrl+G
                                  • Find Previous
                                    Ctrl+Shift+G
                                  • Find…
                                    Ctrl+F
                                  • Log Out
                                    Log out of JupyterLab
                                  • Presentation Mode
                                  • Reset Default Layout
                                  • Search in Selection
                                    Alt+L
                                  • Show Header Above Content
                                  • Show Left Activity Bar
                                  • Show Left Sidebar
                                    Ctrl+B
                                  • Show Log Console
                                  • Show Right Activity Bar
                                  • Show Right Sidebar
                                    Ctrl+J
                                  • Show Status Bar
                                  • Shut Down
                                    Shut down JupyterLab
                                  • Simple Interface
                                    Ctrl+Shift+D
                                  • Notebook Cell Operations
                                  • Change to Code Cell Type
                                    Y
                                  • Change to Heading 1
                                    1
                                  • Change to Heading 2
                                    2
                                  • Change to Heading 3
                                    3
                                  • Change to Heading 4
                                    4
                                  • Change to Heading 5
                                    5
                                  • Change to Heading 6
                                    6
                                  • Change to Markdown Cell Type
                                    M
                                  • Change to Raw Cell Type
                                    R
                                  • Clear Cell Output
                                    Clear outputs for the selected cells
                                  • Collapse All Code
                                  • Collapse All Outputs
                                  • Collapse Selected Code
                                  • Collapse Selected Outputs
                                  • Copy Cell
                                    Copy this cell
                                    C
                                  • Cut Cell
                                    Cut this cell
                                    X
                                  • Delete Cell
                                    Delete this cell
                                    D, D
                                  • Disable Scrolling for Outputs
                                  • Enable Scrolling for Outputs
                                  • Expand All Code
                                  • Expand All Outputs
                                  • Expand Selected Code
                                  • Expand Selected Outputs
                                  • Extend Selection Above
                                    Shift+K
                                  • Extend Selection Below
                                    Shift+J
                                  • Extend Selection to Bottom
                                    Shift+End
                                  • Extend Selection to Top
                                    Shift+Home
                                  • Insert Cell Above
                                    Insert a cell above
                                    A
                                  • Insert Cell Below
                                    Insert a cell below
                                    B
                                  • Insert Heading Above Current Heading
                                    Shift+A
                                  • Insert Heading Below Current Heading
                                    Shift+B
                                  • Merge Cell Above
                                    Ctrl+Backspace
                                  • Merge Cell Below
                                    Ctrl+Shift+M
                                  • Merge Selected Cells
                                    Shift+M
                                  • Move Cell Down
                                    Move this cell down
                                    Ctrl+Shift+Down
                                  • Move Cell Up
                                    Move this cell up
                                    Ctrl+Shift+Up
                                  • Paste Cell Above
                                    Paste this cell from the clipboard
                                  • Paste Cell and Replace
                                  • Paste Cell Below
                                    Paste this cell from the clipboard
                                    V
                                  • Redo Cell Operation
                                    Shift+Z
                                  • Render Side-by-Side
                                    Shift+R
                                  • Run Selected Cell
                                    Run this cell and advance
                                    Shift+Enter
                                  • Run Selected Cell and Do not Advance
                                    Ctrl+Enter
                                  • Run Selected Cell and Insert Below
                                    Alt+Enter
                                  • Run Selected Text or Current Line in Console
                                  • Select Cell Above
                                    K
                                  • Select Cell Below
                                    J
                                  • Select Heading Above or Collapse Heading
                                    Left
                                  • Select Heading Below or Expand Heading
                                    Right
                                  • Set side-by-side ratio
                                  • Split Cell
                                    Ctrl+Shift+-
                                  • Undo Cell Operation
                                    Z
                                  • Notebook Operations
                                  • Access Next Kernel History Entry
                                    Alt+Down
                                  • Access Previous Kernel History Entry
                                    Alt+Up
                                  • Change Kernel…
                                  • Clear Outputs of All Cells
                                    Clear all outputs of all cells
                                  • Close and Shut Down Notebook
                                  • Collapse All Headings
                                    Ctrl+Shift+Left
                                  • Deselect All Cells
                                  • Enter Command Mode
                                    Ctrl+M
                                  • Enter Edit Mode
                                    Enter
                                  • Expand All Headings
                                    Ctrl+Shift+Right
                                  • Interrupt Kernel
                                    Interrupt the kernel
                                  • New Console for Notebook
                                  • New Notebook
                                    Create a new notebook
                                  • Reconnect to Kernel
                                  • Render All Markdown Cells
                                  • Restart Kernel and Clear Outputs of All Cells…
                                    Restart the kernel and clear all outputs of all cells
                                  • Restart Kernel and Debug…
                                    Restart Kernel and Debug…
                                  • Restart Kernel and Run All Cells…
                                    Restart the kernel and run all cells
                                  • Restart Kernel and Run up to Selected Cell…
                                  • Restart Kernel…
                                    Restart the kernel
                                  • Run All Above Selected Cell
                                  • Run All Cells
                                    Run all cells
                                  • Run Selected Cell and All Below
                                  • Save and Export Notebook: Asciidoc
                                  • Save and Export Notebook: Executable Script
                                  • Save and Export Notebook: HTML
                                  • Save and Export Notebook: LaTeX
                                  • Save and Export Notebook: Markdown
                                  • Save and Export Notebook: PDF
                                  • Save and Export Notebook: Qtpdf
                                  • Save and Export Notebook: Qtpng
                                  • Save and Export Notebook: ReStructured Text
                                  • Save and Export Notebook: Reveal.js Slides
                                  • Save and Export Notebook: Webpdf
                                  • Select All Cells
                                    Ctrl+A
                                  • Show Line Numbers
                                  • Toggle Collapse Notebook Heading
                                  • Trust Notebook
                                  • Plugin Manager
                                  • Advanced Plugin Manager
                                  • Settings
                                  • Advanced Settings Editor
                                  • Settings Editor
                                  • Show Contextual Help
                                  • Show Contextual Help
                                    Live updating code documentation from the active kernel
                                  • Terminal
                                  • Decrease Terminal Font Size
                                  • Increase Terminal Font Size
                                  • New Terminal
                                    Start a new terminal session
                                  • Refresh Terminal
                                    Refresh the current terminal session
                                  • Use Terminal Theme: Dark
                                    Set the terminal theme
                                  • Use Terminal Theme: Inherit
                                    Set the terminal theme
                                  • Use Terminal Theme: Light
                                    Set the terminal theme
                                  • Text Editor
                                  • Decrease Font Size
                                  • Increase Font Size
                                  • New Markdown File
                                    Create a new markdown file
                                  • New Python File
                                    Create a new Python file
                                  • New Text File
                                    Create a new text file
                                  • Spaces: 1
                                  • Spaces: 2
                                  • Spaces: 4
                                  • Spaces: 4
                                  • Spaces: 8
                                  • Theme
                                  • Decrease Code Font Size
                                  • Decrease Content Font Size
                                  • Decrease UI Font Size
                                  • Increase Code Font Size
                                  • Increase Content Font Size
                                  • Increase UI Font Size
                                  • Set Preferred Dark Theme: JupyterLab Dark
                                  • Set Preferred Dark Theme: JupyterLab Light
                                  • Set Preferred Light Theme: JupyterLab Dark
                                  • Set Preferred Light Theme: JupyterLab Light
                                  • Synchronize Styling Theme with System Settings
                                  • Theme Scrollbars
                                  • Use Theme: JupyterLab Dark
                                  • Use Theme: JupyterLab Light